8 Statistics Functions
(require math/statistics) | package: math-lib |
This module exports functions that compute statistics, meaning summary values for collections of samples, and functions for managing sequences of weighted or unweighted samples.
Most of the functions that compute statistics accept a sequence of nonnegative reals that correspond one-to-one with sample values. These are used as weights; equivalently counts, pseudocounts or unnormalized probabilities. While this makes it easy to work with weighted samples, it introduces some subtleties in bias correction. In particular, central moments must be computed without bias correction by default. See Expected Values for a discussion.
8.1 Expected Values
Functions documented in this section that compute higher central moments, such as variance, stddev and skewness, can optionally apply bias correction to their estimates. For example, when variance is given the argument #:bias #t, it multiplies the result by (/ n (- n 1)), where n is the number of samples.
> (variance '(1 2 3 4 4) #:bias #t) - : Real [more precisely: Nonnegative-Real]
17/10
> (variance '(1 2 3 4) '(1 1 1 2) #:bias #t) - : Real [more precisely: Nonnegative-Real]
17/10
> (variance '(1 2 3 4) '(1/2 1/2 1/2 1) #:bias 5) - : Real [more precisely: Nonnegative-Real]
17/10
Because the magnitude of the bias correction for weighted samples cannot be known without user guidance, in all cases, the bias argument defaults to #f.
procedure
xs : (Sequenceof Real) ws : (U #f (Sequenceof Real)) = #f
> (mean '(1 2 3 4 5)) - : Real
3
> (mean '(1 2 3 4 5) '(1 1 1 1 10.0)) - : Real
4.285714285714286
> (define d (normal-dist)) > (mean (sample d 10000)) - : Real
0.003660765502416554
> (define arr (array-strict (build-array #(5 1000) (λ (_) (sample d))))) > (array-map mean (array->list-array arr 1)) - : (Array Real)
(array
#[0.015065941536448331
0.0947494841513088
0.02799517497862471
-0.005390713068443041
-0.008329513486680888])
procedure
(variance xs [ws #:bias bias]) → Nonnegative-Real
xs : (Sequenceof Real) ws : (U #f (Sequenceof Real)) = #f bias : (U #t #f Real) = #f
procedure
(stddev xs [ws #:bias bias]) → Nonnegative-Real
xs : (Sequenceof Real) ws : (U #f (Sequenceof Real)) = #f bias : (U #t #f Real) = #f
procedure
xs : (Sequenceof Real) ws : (U #f (Sequenceof Real)) = #f bias : (U #t #f Real) = #f
procedure
(kurtosis xs [ws #:bias bias]) → Nonnegative-Real
xs : (Sequenceof Real) ws : (U #f (Sequenceof Real)) = #f bias : (U #t #f Real) = #f
procedure
(variance/mean m xs [ws #:bias bias]) → Nonnegative-Real
m : Real xs : (Sequenceof Real) ws : (U #f (Sequenceof Real)) = #f bias : (U #t #f Real) = #f
procedure
(stddev/mean m xs [ws #:bias bias]) → Nonnegative-Real
m : Real xs : (Sequenceof Real) ws : (U #f (Sequenceof Real)) = #f bias : (U #t #f Real) = #f
procedure
(skewness/mean m xs [ws #:bias bias]) → Real
m : Real xs : (Sequenceof Real) ws : (U #f (Sequenceof Real)) = #f bias : (U #t #f Real) = #f
procedure
(kurtosis/mean m xs [ws #:bias bias]) → Nonnegative-Real
m : Real xs : (Sequenceof Real) ws : (U #f (Sequenceof Real)) = #f bias : (U #t #f Real) = #f
8.2 Running Expected Values
The statistics object allows computing the sample minimum, maximum, count, mean, variance, skewness, and excess kurtosis of a sequence of samples in O(1) space.
> (let* ([s empty-statistics] [s (update-statistics s 1)] [s (update-statistics s 2)] [s (update-statistics s 3)] [s (update-statistics s 4 2)]) (values (statistics-mean s) (statistics-stddev s #:bias #t))) - : (values Flonum Flonum) [more precisely: (Values Flonum Nonnegative-Flonum)]
2.8
1.3038404810405297
struct
(struct statistics (min max count))
min : Flonum max : Flonum count : Nonnegative-Flonum
The min and max fields are the minimum and maximum value observed so far, and the count field is the total weight of the samples (which is the number of samples if all samples are unweighted). The remaining, hidden fields are used to compute moments, and their number and meaning may change in future releases.
value
> (statistics-min empty-statistics) - : Flonum
+inf.0
> (statistics-max empty-statistics) - : Flonum
-inf.0
> (statistics-range empty-statistics) - : Flonum [more precisely: Nonnegative-Flonum]
+nan.0
> (statistics-count empty-statistics) - : Flonum [more precisely: Nonnegative-Flonum]
0.0
> (statistics-mean empty-statistics) - : Flonum
+nan.0
> (statistics-variance empty-statistics) - : Flonum [more precisely: Nonnegative-Flonum]
+nan.0
> (statistics-skewness empty-statistics) - : Flonum
+nan.0
> (statistics-kurtosis empty-statistics) - : Flonum [more precisely: Nonnegative-Flonum]
+nan.0
procedure
(update-statistics s x [w]) → statistics
s : statistics x : Real w : Real = 1.0
procedure
(update-statistics* s xs [ws]) → statistics
s : statistics xs : (Sequenceof Real) ws : (U #f (Sequenceof Real)) = #f
> (define s (update-statistics* empty-statistics '(1 2 3 4) '(1 1 1 2))) > (statistics-mean s) - : Flonum
2.8
> (statistics-stddev s #:bias #t) - : Flonum [more precisely: Nonnegative-Flonum]
1.3038404810405297
procedure
s : statistics
procedure
(statistics-mean s) → Flonum
s : statistics
procedure
(statistics-variance s [#:bias bias]) → Nonnegative-Flonum
s : statistics bias : (U #t #f Real) = #f
procedure
(statistics-stddev s [#:bias bias]) → Nonnegative-Flonum
s : statistics bias : (U #t #f Real) = #f
procedure
(statistics-skewness s [#:bias bias]) → Flonum
s : statistics bias : (U #t #f Real) = #f
procedure
(statistics-kurtosis s [#:bias bias]) → Nonnegative-Flonum
s : statistics bias : (U #t #f Real) = #f
See Expected Values for the meaning of the bias keyword argument.
8.3 Correlation
procedure
(covariance xs ys [ws #:bias bias]) → Real
xs : (Sequenceof Real) ys : (Sequenceof Real) ws : (U #f (Sequenceof Real)) = #f bias : (U #t #f Real) = #f
procedure
(correlation xs ys [ws #:bias bias]) → Real
xs : (Sequenceof Real) ys : (Sequenceof Real) ws : (U #f (Sequenceof Real)) = #f bias : (U #t #f Real) = #f
> (define xs (sample (normal-dist) 10000)) > (define ys (map (λ: ([x : Real]) (sample (normal-dist x))) xs)) > (correlation xs ys) - : Real
0.7079561916936102
> (define ws (map (λ: ([x : Real] [y : Real]) (/ (pdf (normal-dist) y) (pdf (normal-dist x) y))) xs ys)) > (correlation xs ys (ann ws (Sequenceof Real))) - : Real
0.08030523396049839
See Expected Values for the meaning of the bias keyword argument.
procedure
(covariance/means mx my xs ys [ws #:bias bias]) → Real
mx : Real my : Real xs : (Sequenceof Real) ys : (Sequenceof Real) ws : (U #f (Sequenceof Real)) = #f bias : (U #t #f Real) = #f
procedure
(correlation/means mx my xs ys [ws #:bias bias]) → Real
mx : Real my : Real xs : (Sequenceof Real) ys : (Sequenceof Real) ws : (U #f (Sequenceof Real)) = #f bias : (U #t #f Real) = #f
8.4 Counting and Binning
procedure
(samples->hash xs) → (HashTable A Positive-Integer)
xs : (Sequenceof A) (samples->hash xs ws) → (HashTable A Nonnegative-Real) xs : (Sequenceof A) ws : (U #f (Sequenceof Real))
> (samples->hash '(1 2 3 4 4)) - : (HashTable Integer Positive-Integer)
'#hash((4 . 2) (3 . 1) (2 . 1) (1 . 1))
> (samples->hash '(1 1 2 3 4) '(1/2 1/2 1 1 2)) - : (HashTable Integer Nonnegative-Real)
'#hash((4 . 2) (3 . 1) (2 . 1) (1 . 1))
procedure
(count-samples xs)
→ (Values (Listof A) (Listof Positive-Integer)) xs : (Sequenceof A) (count-samples xs ws) → (Values (Listof A) (Listof Nonnegative-Real)) xs : (Sequenceof A) ws : (U #f (Sequenceof Real))
> (count-samples '(1 2 3 4 4)) - : (values (Listof Positive-Byte) (Listof Positive-Integer))
'(1 2 3 4)
'(1 1 1 2)
> (count-samples '(1 1 2 3 4) '(1/2 1/2 1 1 2)) - : (values (Listof Positive-Byte) (Listof Nonnegative-Real))
'(1 2 3 4)
'(1 1 1 2)
struct
(struct sample-bin (min max values weights))
min : B max : B values : (Listof A) weights : (U #f (Listof Nonnegative-Real))
procedure
(bin-samples bounds lte? xs ws) → (Listof (sample-bin A A))
bounds : (Sequenceof A) lte? : (A A -> Any) xs : (Sequenceof A) ws : (U #f (Sequenceof Real))
If n = (length bounds), then bin-samples returns at least (- n 1) bins, one for each pair of adjacent (sorted) bounds. If some values in xs are less than the smallest bound, they are grouped into a single bin in front. If some are greater than the largest bound, they are grouped into a single bin at the end.
> (bin-samples '() <= '(0 1 2 3 4 5 6)) - : (Listof (sample-bin Byte Byte))
(list (sample-bin 0 6 '(0 1 2 3 4 5 6) #f))
> (bin-samples '(3) <= '(0 1 2 3 4 5 6)) - : (Listof (sample-bin Byte Byte))
(list (sample-bin 0 3 '(0 1 2 3) #f) (sample-bin 3 6 '(4 5 6) #f))
> (bin-samples '(2 4) <= '(0 1 2 3 4 5 6)) - : (Listof (sample-bin Byte Byte))
(list
(sample-bin 0 2 '(0 1 2) #f)
(sample-bin 2 4 '(3 4) #f)
(sample-bin 4 6 '(5 6) #f))
> (bin-samples '(2 4) <= '(0 1 2 3 4 5 6) '(10 20 30 40 50 60 70)) - : (Listof (sample-bin Byte Byte))
(list
(sample-bin 0 2 '(0 1 2) '(10 20 30))
(sample-bin 2 4 '(3 4) '(40 50))
(sample-bin 4 6 '(5 6) '(60 70)))
If lte? is a less-than-or-equal relation, the bins represent half-open intervals (min, max] (except possibly the first, which may be closed). If lte? is a less-than relation, the bins represent half-open intervals [min, max) (except possibly the last, which may be closed). In either case, the sorts applied to bounds and xs are stable.
Because intervals used in probability measurements are normally open on the left, prefer to use less-than-or-equal relations for lte?.
If ws is #f, bin-samples returns bins with #f weights.
procedure
(bin-samples/key bounds lte? key xs ws) → (Listof (sample-bin A B))
bounds : (Sequenceof B) lte? : (B B -> Any) key : (A -> B) xs : (Sequenceof A) ws : (U #f (Sequenceof Real))
procedure
(sample-bin-compact bin) → (sample-bin A B)
bin : (sample-bin A B)
> (sample-bin-compact (sample-bin 1 4 '(1 2 3 4 4) #f)) - : (sample-bin Positive-Byte Positive-Byte)
(sample-bin 1 4 '(1 2 3 4) '(1 1 1 2))
procedure
(sample-bin-total bin) → Nonnegative-Real
bin : (sample-bin A B)
> (sample-bin-total (sample-bin 1 4 '(1 2 3 4 4) #f)) - : Real [more precisely: Nonnegative-Real]
5
> (sample-bin-total (sample-bin-compact (sample-bin 1 4 '(1 2 3 4 4) #f))) - : Real [more precisely: Nonnegative-Real]
5
8.5 Order Statistics
procedure
(sort-samples lt? xs) → (Listof A)
lt? : (A A -> Any) xs : (Sequenceof A) (sort-samples lt? xs ws) → (Values (Listof A) (Listof Nonnegative-Real)) lt? : (A A -> Any) xs : (Sequenceof A) ws : (U #f (Sequenceof Real))
> (sort-samples < '(5 2 3 1)) - : (Listof Positive-Byte)
'(1 2 3 5)
> (sort-samples < '(5 2 3 1) '(50 20 30 10)) - : (values (Listof Positive-Byte) (Listof Nonnegative-Real))
'(1 2 3 5)
'(10 20 30 50)
> (sort-samples < '(5 2 3 1) #f) - : (values (Listof Positive-Byte) (Listof Nonnegative-Real))
'(1 2 3 5)
'(1 1 1 1)
procedure
(median lt? xs [ws]) → A
lt? : (A A -> Any) xs : (Sequenceof A) ws : (U #f (Sequenceof Real)) = #f
procedure
(quantile p lt? xs [ws]) → A
p : Real lt? : (A A -> Any) xs : (Sequenceof A) ws : (U #f (Sequenceof Real)) = #f
> (quantile 0 < '(1 3 5)) - : Integer [more precisely: Positive-Byte]
1
> (quantile 0.5 < '(1 2 3 4)) - : Integer [more precisely: Positive-Byte]
2
> (quantile 0.5 < '(1 2 3 4) '(0.25 0.2 0.2 0.35)) - : Integer [more precisely: Positive-Byte]
3
If p = 0, quantile returns the smallest element of xs under the ordering relation lt?. If p = 1, it returns the largest element.
For weighted samples, quantile sorts xs and ws together (using sort-samples), then finds the least x for which the proportion of its cumulative weight is greater than or equal to p.
For unweighted samples, quantile uses the quickselect algorithm to find the element that would be at index (ceiling (- (* p n) 1)) if xs were sorted, where n is the length of xs.
procedure
(absdev xs [ws]) → Nonnegative-Real
xs : (Sequenceof Real) ws : (U #f (Sequenceof Real)) = #f
procedure
(absdev/median median xs [ws]) → Nonnegative-Real
median : Real xs : (Sequenceof Real) ws : (U #f (Sequenceof Real)) = #f
procedure
(hpd-interval lt? δ p xs [ws]) → (Values A A)
lt? : (A A -> Any) δ : (A A -> Real) p : Real xs : (Sequenceof A) ws : (U #f (Sequenceof Real)) = #f
procedure
(hpd-interval/sorted δ p xs [ws]) → (Values A A)
δ : (A A -> Real) p : Real xs : (Sequenceof A) ws : (U #f (Sequenceof Real)) = #f
To compute an HPD interval from sorted samples, use hpd-interval/sorted.
You almost certainly want to use real-hpd-interval or real-hpd-interval/sorted instead, which are defined in terms of these.
procedure
(real-hpd-interval p xs [ws]) → (Values Real Real)
p : Real xs : (Sequenceof Real) ws : (U #f (Sequenceof Real)) = #f
procedure
(real-hpd-interval/sorted p xs [ws]) → (Values Real Real)
p : Real xs : (Sequenceof Real) ws : (U #f (Sequenceof Real)) = #f
> (define beta32 (beta-dist 3 2)) > (real-dist-hpd-interval beta32 0.8) - : (values Flonum Flonum)
0.36542991742846176
0.8939657937826784
> (real-hpd-interval 0.8 (sample beta32 10000)) - : (values Real Real)
0.37311549017513157
0.8996687451858313
8.6 Simulations
The functions in this section support Monte Carlo simulation; for example, quantifying uncertainty about statistics estimated from samples.
procedure
(mc-variance xs [ws #:bias bias]) → Nonnegative-Real
xs : (Sequenceof Real) ws : (U #f (Sequenceof Real)) = #f bias : (U #t #f Real) = #f
procedure
(mc-stddev xs [ws #:bias bias]) → Nonnegative-Real
xs : (Sequenceof Real) ws : (U #f (Sequenceof Real)) = #f bias : (U #t #f Real) = #f
(mc-variance xs ws #:bias bias) (mc-stddev xs ws #:bias bias)
> (mc-stddev (sample (normal-dist 0 1) 1000)) - : Real [more precisely: Nonnegative-Real]
0.030788138085986416
> (stddev (for/list : (Listof Real) ([_ (in-range 100)]) (mean (sample (normal-dist 0 1) 1000)))) - : Real [more precisely: Nonnegative-Real]
0.031583191347587164
procedure
(mc-stddev/mean m xs [ws #:bias bias]) → Nonnegative-Real
m : Real xs : (Sequenceof Real) ws : (U #f (Sequenceof Real)) = #f bias : (U #t #f Real) = #f
procedure
(mc-variance/mean m xs [ws #:bias bias]) → Nonnegative-Real
m : Real xs : (Sequenceof Real) ws : (U #f (Sequenceof Real)) = #f bias : (U #t #f Real) = #f
> (fl (mean (map (indicator (λ ([x : Real]) (< -inf.0 x -1))) (sample (normal-dist 0 1) 5000)))) - : Flonum
0.158
> (real-dist-prob (normal-dist 0 1) -inf.0 -1) - : Flonum
0.15865525393145705
procedure
(mc-probability pred? xs [ws]) → Nonnegative-Real
pred? : (A -> Any) xs : (Sequenceof A) ws : (U #f (Sequenceof Real)) = #f
> (fl (mc-probability (λ ([x : Real]) (< -inf.0 x -1)) (sample (normal-dist 0 1) 5000))) - : Flonum [more precisely: Nonnegative-Flonum]
0.16
procedure
(mc-prob-dist pred? xs [ws]) → Beta-Dist
pred? : (A -> Any) xs : (Sequenceof A) ws : (U #f (Sequenceof Real)) = #f
> (real-dist-hpd-interval (mc-prob-dist (λ ([x : Real]) (< -inf.0 x -1)) (sample (normal-dist 0 1) 5000)) 0.95) - : (values Flonum Flonum)
0.15865322547577254
0.17942105904047267