8 Statistics Functions

6.9

8 Statistics Functions

package: math-lib

This module exports functions that compute statistics, meaning summary values for collections of samples, and functions for managing sequences of weighted or unweighted samples.

Most of the functions that compute statistics accept a sequence of nonnegative reals that correspond one-to-one with sample values. These are used as weights; equivalently counts, pseudocounts or unnormalized probabilities. While this makes it easy to work with weighted samples, it introduces some subtleties in bias correction. In particular, central moments must be computed without bias correction by default. See Expected Values for a discussion.

8.1 Expected Values

8.2 Running Expected Values

8.3 Correlation

8.4 Counting and Binning

8.5 Order Statistics

8.6 Simulations

8.1 Expected Values

Functions documented in this section that compute higher central moments, such as variance, stddev and skewness, can optionally apply bias correction to their estimates. For example, when variance is given the argument #:bias #t, it multiplies the result by (/ n (- n 1)), where n is the number of samples.

The meaning of “bias correction” becomes less clear with weighted samples, however. Often, the weights represent counts, so when moment-estimating functions receive #:bias #t, they interpret it as “use the sum of ws for n.” In the following example, the sample 4 is first counted twice and then given weight 2; therefore n = 5 in both cases:

> (variance '(1 2 3 4 4) #:bias #t)
- : Real [more precisely: Nonnegative-Real]
17/10
> (variance '(1 2 3 4) '(1 1 1 2) #:bias #t)
- : Real [more precisely: Nonnegative-Real]
17/10

However, sample weights often do not represent counts. For these cases, the #:bias keyword can be followed by a real-valued pseudocount, which is used for n:

> (variance '(1 2 3 4) '(1/2 1/2 1/2 1) #:bias 5)
- : Real [more precisely: Nonnegative-Real]
17/10

Because the magnitude of the bias correction for weighted samples cannot be known without user guidance, in all cases, the bias argument defaults to #f.

procedure
(mean xs [ws]) → Real
xs : (Sequenceof Real)
ws : (U #f (Sequenceof Real)) = #f

When ws is #f (the default), returns the sample mean of the values in xs. Otherwise, returns the weighted sample mean of the values in xs with corresponding weights ws.

Examples:

> (mean '(1 2 3 4 5))
- : Real
3
> (mean '(1 2 3 4 5) '(1 1 1 1 10.0))
- : Real
4.285714285714286
> (define d (normal-dist))
> (mean (sample d 10000))
- : Real
0.003660765502416554
> (define arr (array-strict (build-array #(5 1000) (λ (_) (sample d)))))
> (array-map mean (array->list-array arr 1))
- : (Array Real)
(array
#[0.015065941536448331
   0.0947494841513088
   0.02799517497862471
   -0.005390713068443041
   -0.008329513486680888])

procedure
(variance xs [ws #:bias bias]) → Nonnegative-Real
  xs : (Sequenceof Real)
  ws : (U #f (Sequenceof Real)) = #f
  bias : (U #t #f Real) = #f
procedure
(stddev xs [ws #:bias bias]) → Nonnegative-Real
  xs : (Sequenceof Real)
  ws : (U #f (Sequenceof Real)) = #f
  bias : (U #t #f Real) = #f
procedure
(skewness xs [ws #:bias bias]) → Real
  xs : (Sequenceof Real)
  ws : (U #f (Sequenceof Real)) = #f
  bias : (U #t #f Real) = #f
procedure
(kurtosis xs [ws #:bias bias]) → Nonnegative-Real
  xs : (Sequenceof Real)
  ws : (U #f (Sequenceof Real)) = #f
  bias : (U #t #f Real) = #f

If ws is #f, these compute the sample variance, standard deviation, skewness and excess kurtosis the samples in xs. If ws is not #f, they compute weighted variations of the same.

Examples:

> (stddev '(1 2 3 4 5))
- : Real [more precisely: Nonnegative-Real]
1.4142135623730951
> (stddev '(1 2 3 4 5) '(1 1 1 1 10))
- : Real [more precisely: Nonnegative-Real]
1.2777531299998799

See Expected Values for the meaning of the bias keyword argument.

procedure
(variance/mean m xs [ws #:bias bias]) → Nonnegative-Real
  m : Real
  xs : (Sequenceof Real)
  ws : (U #f (Sequenceof Real)) = #f
  bias : (U #t #f Real) = #f
procedure
(stddev/mean m xs [ws #:bias bias]) → Nonnegative-Real
  m : Real
  xs : (Sequenceof Real)
  ws : (U #f (Sequenceof Real)) = #f
  bias : (U #t #f Real) = #f
procedure
(skewness/mean m xs [ws #:bias bias]) → Real
  m : Real
  xs : (Sequenceof Real)
  ws : (U #f (Sequenceof Real)) = #f
  bias : (U #t #f Real) = #f
procedure
(kurtosis/mean m xs [ws #:bias bias]) → Nonnegative-Real
  m : Real
  xs : (Sequenceof Real)
  ws : (U #f (Sequenceof Real)) = #f
  bias : (U #t #f Real) = #f

Like variance, stddev, skewness and kurtosis, but computed using known mean m.

8.2 Running Expected Values

The statistics object allows computing the sample minimum, maximum, count, mean, variance, skewness, and excess kurtosis of a sequence of samples in O(1) space.

To use it, start with empty-statistics, then use update-statistics to obtain a new statistics object with updated values. Use statistics-min, statistics-mean, and similar functions to get the current estimates.

Example:

> (let* ([s  empty-statistics]
         [s  (update-statistics s 1)]
         [s  (update-statistics s 2)]
         [s  (update-statistics s 3)]
         [s  (update-statistics s 4 2)])
    (values (statistics-mean s)
            (statistics-stddev s #:bias #t)))
- : (values Flonum Flonum) [more precisely: (Values Flonum Nonnegative-Flonum)]
2.8
1.3038404810405297

struct
(struct statistics (min max count))
  min : Flonum
  max : Flonum
  count : Nonnegative-Flonum

Represents running statistics.

The min and max fields are the minimum and maximum value observed so far, and the count field is the total weight of the samples (which is the number of samples if all samples are unweighted). The remaining, hidden fields are used to compute moments, and their number and meaning may change in future releases.

value
empty-statistics : statistics

The empty statistics object.

Examples:

> (statistics-min empty-statistics)
- : Flonum
+inf.0
> (statistics-max empty-statistics)
- : Flonum
-inf.0
> (statistics-range empty-statistics)
- : Flonum [more precisely: Nonnegative-Flonum]
+nan.0
> (statistics-count empty-statistics)
- : Flonum [more precisely: Nonnegative-Flonum]
0.0
> (statistics-mean empty-statistics)
- : Flonum
+nan.0
> (statistics-variance empty-statistics)
- : Flonum [more precisely: Nonnegative-Flonum]
+nan.0
> (statistics-skewness empty-statistics)
- : Flonum
+nan.0
> (statistics-kurtosis empty-statistics)
- : Flonum [more precisely: Nonnegative-Flonum]
+nan.0

procedure
(update-statistics s x [w]) → statistics
  s : statistics
  x : Real
  w : Real = 1.0

Returns a new statistics object that includes x in the computed statistics. If w is given, x is weighted by w in the moment computations.

procedure
(update-statistics* s xs [ws]) → statistics
  s : statistics
  xs : (Sequenceof Real)
  ws : (U #f (Sequenceof Real)) = #f

Like update-statistics, but includes all of xs, possibly weighted by corresponding elements in ws, in the returned statistics object.

Examples:

> (define s (update-statistics* empty-statistics '(1 2 3 4) '(1 1 1 2)))
> (statistics-mean s)
- : Flonum
2.8
> (statistics-stddev s #:bias #t)
- : Flonum [more precisely: Nonnegative-Flonum]
1.3038404810405297

This function uses O(1) space regardless of the length of xs.

procedure
(statistics-range s) → Nonnegative-Flonum
  s : statistics
procedure
(statistics-mean s) → Flonum
  s : statistics
procedure
(statistics-variance s [#:bias bias]) → Nonnegative-Flonum
  s : statistics
  bias : (U #t #f Real) = #f
procedure
(statistics-stddev s [#:bias bias]) → Nonnegative-Flonum
  s : statistics
  bias : (U #t #f Real) = #f
procedure
(statistics-skewness s [#:bias bias]) → Flonum
  s : statistics
  bias : (U #t #f Real) = #f
procedure
(statistics-kurtosis s [#:bias bias]) → Nonnegative-Flonum
  s : statistics
  bias : (U #t #f Real) = #f

Compute the range, mean, variance, standard deviation, skewness, and excess kurtosis of the observations summarized in s.

See Expected Values for the meaning of the bias keyword argument.

8.3 Correlation

procedure
(covariance xs ys [ws #:bias bias]) → Real
  xs : (Sequenceof Real)
  ys : (Sequenceof Real)
  ws : (U #f (Sequenceof Real)) = #f
  bias : (U #t #f Real) = #f
procedure
(correlation xs ys [ws #:bias bias]) → Real
  xs : (Sequenceof Real)
  ys : (Sequenceof Real)
  ws : (U #f (Sequenceof Real)) = #f
  bias : (U #t #f Real) = #f

Compute the sample covariance and correlation of xs and ys, optionally weighted by ws.

Examples:

> (define xs (sample (normal-dist) 10000))
> (define ys (map (λ: ([x : Real]) (sample (normal-dist x))) xs))
> (correlation xs ys)
- : Real
0.7079561916936102

Removing the correlation using importance weights:

> (define ws (map (λ: ([x : Real] [y : Real])
                    (/ (pdf (normal-dist) y)
                       (pdf (normal-dist x) y)))
                  xs ys))
> (correlation xs ys (ann ws (Sequenceof Real)))
- : Real
0.08030523396049839

See Expected Values for the meaning of the bias keyword argument.

procedure
(covariance/means mx my xs ys [ws #:bias bias]) → Real
  mx : Real
  my : Real
  xs : (Sequenceof Real)
  ys : (Sequenceof Real)
  ws : (U #f (Sequenceof Real)) = #f
  bias : (U #t #f Real) = #f
procedure
(correlation/means mx my xs ys [ws #:bias bias]) → Real
  mx : Real
  my : Real
  xs : (Sequenceof Real)
  ys : (Sequenceof Real)
  ws : (U #f (Sequenceof Real)) = #f
  bias : (U #t #f Real) = #f

Like covariance and correlation, but computed using known means mx and my.

8.4 Counting and Binning

procedure
(samples->hash xs) → (HashTable A Positive-Integer)
  xs : (Sequenceof A)
(samples->hash xs ws) → (HashTable A Nonnegative-Real)
  xs : (Sequenceof A)
  ws : (U #f (Sequenceof Real))

Returns a hash table mapping each unique element in xs (under equal?) to its count, or, if ws is not #f, to its total weight.

Examples:

> (samples->hash '(1 2 3 4 4))
- : (HashTable Integer Positive-Integer)
'#hash((4 . 2) (3 . 1) (2 . 1) (1 . 1))
> (samples->hash '(1 1 2 3 4) '(1/2 1/2 1 1 2))
- : (HashTable Integer Nonnegative-Real)
'#hash((4 . 2) (3 . 1) (2 . 1) (1 . 1))

procedure
(count-samples xs)
→ (Values (Listof A) (Listof Positive-Integer))
  xs : (Sequenceof A)
(count-samples xs ws)
→ (Values (Listof A) (Listof Nonnegative-Real))
  xs : (Sequenceof A)
  ws : (U #f (Sequenceof Real))

Like samples->hash, but returns two lists. The elements in the returned (Listof A) are in order of first appearance in xs.

Examples:

> (count-samples '(1 2 3 4 4))
- : (values (Listof Positive-Byte) (Listof Positive-Integer))
'(1 2 3 4)
'(1 1 1 2)
> (count-samples '(1 1 2 3 4) '(1/2 1/2 1 1 2))
- : (values (Listof Positive-Byte) (Listof Nonnegative-Real))
'(1 2 3 4)
'(1 1 1 2)

struct
(struct sample-bin (min max values weights))
  min : B
  max : B
  values : (Listof A)
  weights : (U #f (Listof Nonnegative-Real))

Represents a bin, or a group of samples within an interval in a total order. The values and bounds have a different type to allow bin-samples/key to group elements based on a function of their values.

procedure
(bin-samples bounds lte? xs ws) → (Listof (sample-bin A A))
  bounds : (Sequenceof A)
  lte? : (A A -> Any)
  xs : (Sequenceof A)
  ws : (U #f (Sequenceof Real))

Similar to (sort xs lte?), but additionally groups samples into bins. The bins’ bounds are sorted before binning xs.

If n = (length bounds), then bin-samples returns at least (- n 1) bins, one for each pair of adjacent (sorted) bounds. If some values in xs are less than the smallest bound, they are grouped into a single bin in front. If some are greater than the largest bound, they are grouped into a single bin at the end.

Examples:

> (bin-samples '() <= '(0 1 2 3 4 5 6))
- : (Listof (sample-bin Byte Byte))
(list (sample-bin 0 6 '(0 1 2 3 4 5 6) #f))
> (bin-samples '(3) <= '(0 1 2 3 4 5 6))
- : (Listof (sample-bin Byte Byte))
(list (sample-bin 0 3 '(0 1 2 3) #f) (sample-bin 3 6 '(4 5 6) #f))
> (bin-samples '(2 4) <= '(0 1 2 3 4 5 6))
- : (Listof (sample-bin Byte Byte))
(list
(sample-bin 0 2 '(0 1 2) #f)
(sample-bin 2 4 '(3 4) #f)
(sample-bin 4 6 '(5 6) #f))
> (bin-samples '(2 4) <=
'(0 1 2 3 4 5 6)
'(10 20 30 40 50 60 70))
- : (Listof (sample-bin Byte Byte))
(list
(sample-bin 0 2 '(0 1 2) '(10 20 30))
(sample-bin 2 4 '(3 4) '(40 50))
(sample-bin 4 6 '(5 6) '(60 70)))

If lte? is a less-than-or-equal relation, the bins represent half-open intervals (min, max] (except possibly the first, which may be closed). If lte? is a less-than relation, the bins represent half-open intervals [min, max) (except possibly the last, which may be closed). In either case, the sorts applied to bounds and xs are stable.

Because intervals used in probability measurements are normally open on the left, prefer to use less-than-or-equal relations for lte?.

If ws is #f, bin-samples returns bins with #f weights.

procedure
(bin-samples/key bounds lte? key xs ws) → (Listof (sample-bin A B))
  bounds : (Sequenceof B)
  lte? : (B B -> Any)
  key : (A -> B)
  xs : (Sequenceof A)
  ws : (U #f (Sequenceof Real))

Similar to (sort xs lte? #:key key #:cache-keys? #t), but additionally groups samples into bins.

Example:

> (bin-samples/key '(2 4) <= (inst car Real String)
'((1 . "1") (2 . "2") (3 . "3") (4 . "4") (5 . "5")))
- : (Listof (sample-bin (Pairof Positive-Byte String) Real))
(list
(sample-bin 1 2 '((1 . "1") (2 . "2")) #f)
(sample-bin 2 4 '((3 . "3") (4 . "4")) #f)
(sample-bin 4 5 '((5 . "5")) #f))

See bin-samples for the simpler, one-type variant.

procedure
(sample-bin-compact bin) → (sample-bin A B)
bin : (sample-bin A B)

Compacts bin by applying count-samples to its values and weights.

Example:

> (sample-bin-compact (sample-bin 1 4 '(1 2 3 4 4) #f))
- : (sample-bin Positive-Byte Positive-Byte)
(sample-bin 1 4 '(1 2 3 4) '(1 1 1 2))

procedure
(sample-bin-total bin) → Nonnegative-Real
bin : (sample-bin A B)

If (sample-bin-weights bin) is #f, returns the number of samples in bin; otherwise, returns the sum of their weights.

Examples:

> (sample-bin-total (sample-bin 1 4 '(1 2 3 4 4) #f))
- : Real [more precisely: Nonnegative-Real]
5
> (sample-bin-total (sample-bin-compact (sample-bin 1 4 '(1 2 3 4 4) #f)))
- : Real [more precisely: Nonnegative-Real]
5

8.5 Order Statistics

procedure
(sort-samples lt? xs) → (Listof A)
  lt? : (A A -> Any)
  xs : (Sequenceof A)
(sort-samples lt? xs ws)
→ (Values (Listof A) (Listof Nonnegative-Real))
  lt? : (A A -> Any)
  xs : (Sequenceof A)
  ws : (U #f (Sequenceof Real))

Sorts possibly weighted samples according to lt?, which is assumed to define a total order over the elements in xs.

Examples:

> (sort-samples < '(5 2 3 1))
- : (Listof Positive-Byte)
'(1 2 3 5)
> (sort-samples < '(5 2 3 1) '(50 20 30 10))
- : (values (Listof Positive-Byte) (Listof Nonnegative-Real))
'(1 2 3 5)
'(10 20 30 50)
> (sort-samples < '(5 2 3 1) #f)
- : (values (Listof Positive-Byte) (Listof Nonnegative-Real))
'(1 2 3 5)
'(1 1 1 1)

Because sort-samples is defined in terms of sort, the sort is only guaranteed to be stable if lt? is strictly a less-than relation.

procedure
(median lt? xs [ws]) → A
  lt? : (A A -> Any)
  xs : (Sequenceof A)
  ws : (U #f (Sequenceof Real)) = #f

Equivalent to (quantile 1/2 lt? xs ws).

procedure
(quantile p lt? xs [ws]) → A
  p : Real
  lt? : (A A -> Any)
  xs : (Sequenceof A)
  ws : (U #f (Sequenceof Real)) = #f

Computes the inverse of the empirical cdf represented by the samples xs, which are optionally weighted by ws.

Examples:

> (quantile 0 < '(1 3 5))
- : Integer [more precisely: Positive-Byte]
1
> (quantile 0.5 < '(1 2 3 4))
- : Integer [more precisely: Positive-Byte]
2
> (quantile 0.5 < '(1 2 3 4) '(0.25 0.2 0.2 0.35))
- : Integer [more precisely: Positive-Byte]
3

If p = 0, quantile returns the smallest element of xs under the ordering relation lt?. If p = 1, it returns the largest element.

For weighted samples, quantile sorts xs and ws together (using sort-samples), then finds the least x for which the proportion of its cumulative weight is greater than or equal to p.

For unweighted samples, quantile uses the quickselect algorithm to find the element that would be at index (ceiling (- (* p n) 1)) if xs were sorted, where n is the length of xs.

procedure
(absdev xs [ws]) → Nonnegative-Real
xs : (Sequenceof Real)
ws : (U #f (Sequenceof Real)) = #f

Computes the average absolute difference between the elements in xs and (median < xs ws). If ws is not #f, it is a weighted average.

procedure
(absdev/median median xs [ws]) → Nonnegative-Real
  median : Real
  xs : (Sequenceof Real)
  ws : (U #f (Sequenceof Real)) = #f

Like (absdev xs ws), but computed using known median median.

procedure
(hpd-interval lt? δ p xs [ws]) → (Values A A)
  lt? : (A A -> Any)
  δ : (A A -> Real)
  p : Real
  xs : (Sequenceof A)
  ws : (U #f (Sequenceof Real)) = #f
procedure
(hpd-interval/sorted δ p xs [ws]) → (Values A A)
  δ : (A A -> Real)
  p : Real
  xs : (Sequenceof A)
  ws : (U #f (Sequenceof Real)) = #f

Estimates the smallest interval for which the distribution that produced xs (optionally weighted by ws) assigns probability p, which must be positive. The type A represents an ordered metric space with order lt? and metric δ.

To compute an HPD interval from sorted samples, use hpd-interval/sorted.

You almost certainly want to use real-hpd-interval or real-hpd-interval/sorted instead, which are defined in terms of these.

procedure
(real-hpd-interval p xs [ws]) → (Values Real Real)
  p : Real
  xs : (Sequenceof Real)
  ws : (U #f (Sequenceof Real)) = #f
procedure
(real-hpd-interval/sorted p xs [ws]) → (Values Real Real)
  p : Real
  xs : (Sequenceof Real)
  ws : (U #f (Sequenceof Real)) = #f

Equivalent to (hpd-interval < - p xs ws) and (hpd-interval/sorted - p xs ws).

Examples:

> (define beta32 (beta-dist 3 2))
> (real-dist-hpd-interval beta32 0.8)
- : (values Flonum Flonum)
0.36542991742846176
0.8939657937826784
> (real-hpd-interval 0.8 (sample beta32 10000))
- : (values Real Real)
0.37311549017513157
0.8996687451858313

8.6 Simulations

The functions in this section support Monte Carlo simulation; for example, quantifying uncertainty about statistics estimated from samples.

procedure
(mc-variance xs [ws #:bias bias]) → Nonnegative-Real
  xs : (Sequenceof Real)
  ws : (U #f (Sequenceof Real)) = #f
  bias : (U #t #f Real) = #f
procedure
(mc-stddev xs [ws #:bias bias]) → Nonnegative-Real
  xs : (Sequenceof Real)
  ws : (U #f (Sequenceof Real)) = #f
  bias : (U #t #f Real) = #f

Estimate the variance and standard deviation of expected values computed from random samples.

If xs are random variable samples, then

(define θ (mean xs ws))

is also a random variable sample. These two values:

(mc-variance xs ws #:bias bias)
(mc-stddev xs ws #:bias bias)

estimate the variance and standard deviation of θ. The latter is simply the square root of the variance, and bias correction is applied as described in Expected Values.

Two different ways to estimate the standard deviation of a mean computed from 1000 samples are

> (mc-stddev (sample (normal-dist 0 1) 1000))
- : Real [more precisely: Nonnegative-Real]
0.030788138085986416
> (stddev (for/list : (Listof Real) ([_ (in-range 100)])
(mean (sample (normal-dist 0 1) 1000))))
- : Real [more precisely: Nonnegative-Real]
0.031583191347587164

Using mc-stddev is 100 times faster in this case, and in most statistical applications, replicating a sampling process 100 times is infeasible.

procedure
(mc-stddev/mean m xs [ws #:bias bias]) → Nonnegative-Real
  m : Real
  xs : (Sequenceof Real)
  ws : (U #f (Sequenceof Real)) = #f
  bias : (U #t #f Real) = #f
procedure
(mc-variance/mean m xs [ws #:bias bias]) → Nonnegative-Real
  m : Real
  xs : (Sequenceof Real)
  ws : (U #f (Sequenceof Real)) = #f
  bias : (U #t #f Real) = #f

Use these in the exceedingly rare cases in which you know the mean m but are estimating uncertainty in an estimate of the mean anyway.

procedure
(indicator pred?) → (A -> (U 0 1))
pred? : (A -> Any)

Converts a predicate into an indicator function.

Examples:

> (fl (mean (map (indicator (λ ([x : Real]) (< -inf.0 x -1)))
(sample (normal-dist 0 1) 5000))))
- : Flonum
0.158
> (real-dist-prob (normal-dist 0 1) -inf.0 -1)
- : Flonum
0.15865525393145705

procedure
(mc-probability pred? xs [ws]) → Nonnegative-Real
  pred? : (A -> Any)
  xs : (Sequenceof A)
  ws : (U #f (Sequenceof Real)) = #f

Estimates the probability of pred? from possibly weighted samples. Equivalent to (mean (sequence-map (indicator pred?) xs) ws).

Example:

> (fl (mc-probability (λ ([x : Real]) (< -inf.0 x -1))
(sample (normal-dist 0 1) 5000)))
- : Flonum [more precisely: Nonnegative-Flonum]
0.16

procedure
(mc-prob-dist pred? xs [ws]) → Beta-Dist
  pred? : (A -> Any)
  xs : (Sequenceof A)
  ws : (U #f (Sequenceof Real)) = #f

Returns a beta distribution estimated from possibly weighted samples whose mean is (mc-probability pred? xs ws).

Computing a confidence interval for a probability whose endpoints are guaranteed to be between 0 and 1:

> (real-dist-hpd-interval
   (mc-prob-dist (λ ([x : Real]) (< -inf.0 x -1))
                 (sample (normal-dist 0 1) 5000))
   0.95)
- : (values Flonum Flonum)
0.15865322547577254
0.17942105904047267

top ← prev up next →

1	Constants and Elementary Functions
2	Flonums
3	Special Functions
4	Number Theory
5	Arbitrary-Precision Floating-Point Numbers (Bigfloats)
6	Arrays
7	Matrices and Linear Algebra
8	Statistics Functions
9	Probability Distributions
10	Stuff That Doesn’t Belong Anywhere Else

8.1	Expected Values
8.2	Running Expected Values
8.3	Correlation
8.4	Counting and Binning
8.5	Order Statistics
8.6	Simulations