Statistics is the branch of mathematics dealing with the collection and analysis of data. As we are increasingly overwhelmed with data it is important to be able to analyze that data, and so statistics should have a prominant role in any programming language's standard libraries. Broadly speaking statistics can be divided into descriptive statistics and inferential statistics. This library provides basic utilities for both of these covering many common use cases, while serving as a basis for more advanced uses.
This library takes a holistic view of a distribution, with procedures working on both theoretical and empirical distributions. Returns#t
iff x is a distribution.
Creates a new theoretical distribution. name
should be a
symbol and is used for debugging. The probability density
function pdf
, cumulative distribution function cdf
,
and summary statistics
are required. random
is
optionally and can be inferred from the cdf
but faster
implementations are usually available.Creates a new theoretical discrete distribution, similar to
pure-distribution
.Creates a new distribution for the sampled data in seq
.Equivalent to distribution
, but records that seq
represents the full population rather than a subsample.Creates a new running sample distribution to which new elements
can be added.Returns a new distribution from summmary statistics, without a
sample. Takes keyword arguments in the style of
(chibi optional)
, with the following values:
- name: used for debugging
- size: the population size
- mean: the mean (average) value
- median: the median (middle) value
- mode: the mode (most common) value
- variance: the variance
- skew: the skew
- kurtosis: the kurtosis
- minimum: the minimum value
- maximum: the maximum value
- discrete?: true iff this is a discrete distribution
- population?: true iff this is a population (as opposed to sample)
(mean (summary-distribution 'mean: 0.5 'variance: 2.5))
=> 0.5
#t
iff dist1
is a theoretical distribution
(is derived from a pdf).Returns #t
iff dist1
is a sample distribution to
which new elements can be added.Adds a new element elt
to the sample distribution dist
.
Returns the pdf if dist
is pure a distribution, #f
otherwise.
Returns the cdf if dist
is pure a distribution, #f
otherwise.
Returns the underlying values, or sampled representatives, of
dist
, if available. If you want a sample from any
distribution, use random-sample
.Returns a new normal (gaussian) distribution, with the given
mean
and variance
. Also know as a bell curve, many
things are normally distributed, and the Central Limit THeorem
says that the sum of independent random variables tends towards
normal.Returns a new continuous uniform distribution over [a
,
b
]. Every value in the range is equally likely.Returns a new discrete uniform distribution over [a
,
b
], in step
intervals. Every value in the range is
equally likely.Returns a new poisson distribution, a discrete distribution
measuring the probability that a certain number of events occur
within a certain period of time.Returns a new kolmogorov distribution, the distribution of the
suprememum of the Brownian bridge over [0, 1]. This is used in
the Kolmogorov-Smirnoff test.Returns a new binomial distribution, a discrete distribution of
the number of successes over n
Bernoulli trials with
probability p
.Returns a new F-distribution, aka the Fisher–Snedecor
distribution, used in the F-test for ANOVA.
Descriptive statistics summarize some aspect of a distribution.
The following procedures all take either a distribution, or for
convenience a list or vector as an implicit sample distribution (as
constructed by the distribution
procedure).
If a list of vector is passed, an optional getter argument is
allowed to return the value of each element in the sequence. This
is to allow idioms similar to dataframes in R and numpy, where the
sequence elements could be vectors or records containing multiple
fields.
If a distribution is passed, we may not have complete information
for that distribution, so any of these may return #f
.
Returns the size of the distribution population, which may be
+inf.0 for pure distributions, or #f if unknown.
(size '(1 2 3))
=> 3
(size (normal-distribution 0 1))
=> +inf.0
mean
,
geometric-mean
and harmonic-mean
are the three
classical Pythagorean means. The mean
, median
and mode
can be unified as:
(mode xi) => (argmin s (sum (expt (abs (- xi s)) 0)))
(median xi) => (argmin s (sum (expt (abs (- xi s)) 1)))
(mean xi) => (argmin s (sum (expt (abs (- xi s)) 2)))
Returns the arithmetic-mean, aka the first moment, what people
typically think of as the average. As with other procedures in
this library, the computation should be numerically stable.
(mean '(1 2 3))
=> 2
(mean '((a . 1) (b . 2) (c . 3)) cdr)
=> 2
(mean (make-list 1000000 #i1/3))
=> 0.3333333333333333
(mean (normal-distribution 2.4 7))
=> 2.4
geometric-mean
is often used for numbers whose values
are exponential in nature, such as population growth or financial
investments.
Returns the harmonic mean, aka the F-1 score, of the distribution.
This is the reciprocal of the arithmetic mean
of the
reciprocals of the values. The harmonic-mean
is often
used in machine learning as an aggregated performance score over
other metrics (such as precision and recall), which penalizes a
single low score harder than the arithmetic or geometric means
would.
Returns the quadratic mean of the distribution, aka the
root-mean-squared (RMS), which is the square root of the mean
square (the generalized mean with exponent 2). The RMS of a
deviation or error (RMSE) is frequently used in machine learning
to evaluate the fitness of a model.
Returns the median of a distribution, the 50th percentile or
"middle" value if the elements were sorted.
Returns the mode, or most frequently occuring element of the
distribution.
Returns the variance, aka the second central moment, of a
distribution, defined as the expectation of the squared deviation
from the mean. Variance is used extensively in descriptive and
inferential statistics.
Returns the standard deviation, the square root of the
variance
. Unlike variance
this is expressed in
the same unit as the data, and can thus be easier to reason about.
In particular, the 68-95-99.7 rule states that 68% of the
population is within 1 stdev of the mean, 95% within 2 stdevs, and
99.7% within 3 stdevs.
A simple form of outlier removal is to remove data points more
than 3 (or 4) standard deviations from the mean.
An alias for standard-deviation
.
Returns the coefficient of variation, aka the relative standard
deviation (RSD) of dist
, the ratio of the
standard-deviation
to the mean
.
Returns the covariance, a measure of the joint variability, of the
two (non-pure) distributions dist1
and dist2
. The
sign of the covariance shows the tendency in the linear
relationship between the variables, though the magnitude is hard
to interpret. For a better understanding of the strength of the
relation, use the pearson-correlation
.
Returns the Pearson correlation coefficient (PCC), aka Pearson's
r, the Pearson product-moment correlation coefficient (PPMCC), or
the bivariate correlation. This is a value between [-1, 1]
showing the correlation between the two distributions dist1
and dist2
, with positive values indicating a positive linear
correlation, negative values indicating a negative linear
correlation, and values close to zero indicating no correlation.
Returns the Spearman's rank correlation coefficient, aka
Spearman's ρ. This is the pearson-correlation
between
the rank variables of dist1
and dist2
, indicating how
well their relationship can be described by a monotonic function.
Returns the skewness, aka the third standardized moment, of
dist
, which indicates the lopsidedness of the distribution.
For a unimodal distribution, negative skew indicates the tail is
to the left of the mode, and positive skew indicates it is to the
right. Zero skew indicates the tails balance out, though may not
be symmetric.
Returns the kurtosis, aka the fourth standardizes moment, of
dist
, indicating the heaviness of the tail. A
kurtosis
less than 3 is called "platykurtic", suggesting
the distribution is flat or has fewer extremes than a normal
distribution, whereas a kurtosis
greater than 3 is called
"leptokurtic", and has more outliers than a normal distribution.
Returns the excess kurtosis, aka Pearson's kurtosis, of
dist
. The kurtosis
of any normal distribution is
3, so this is defined as the kurtosis minus 3.
Returns the maximum value of dist
.
Returns the minimum value of dist
.
Returns the n
th percentile of dist
. If dist
is
non-pure and the rank doesn't fall on an exact position, uses
linear interpolation between adjacent ranks, which produces better
results for small samples.
Ranks a sorted list (x1 x2 ... xn) returning
((rank1 . x1) (rank2 . x2) ... xn)
where the ranks are 1..n if there all xi are unique,
otherwise the average rank among equal elements.Returns a new list with the elements replaced by their rank.
Returns a random flip of a weighted coin, i.e. runs a Bernoulli
trial, returning #t
with probability p
(default
0.5) and #f
otherwise.Returns a random value from distribution dist
.Returns a sequence of n
random value from distribution
dist
. If dist
is a list or distribution based on one
the result is a list, otherwise it's a vector.Like random-sample
, but returns a sample with replacement.
Compute the minimum sample size for a population num-total, with
the given confidence, error bound requirements, and population
proportion.
- num-total: total population size (+inf.0 uses the normal approximation to binomial dist, otherwise we use the normal approximation to the hypergeometric dist)
- conf: desired level of confidence, default 1.96 (95%)
- err: desired error bounds, default 3%
- p: population proportion, default 0.5
(min-sample-size +inf.0 2.75 0.01)
=> 18906.25
(min-sample-size 73015 2.75 0.01)
=> 15017.799056791075
sample
can be approximated
by a normal distribution. Returns the probability that
sample
is consistent with dist
, default one-tailed.
This is more convenient than the T-test if the population size is
large or the variance is known.
tails
must be 1 (the default) for a one-tailed test, or 2
for a two-tailed test. If you are testing whether a mean is
greater (or less) than the population, you want a one-tailed test,
but if you are just testing whether it is different (in either
direction) you want a two-tailed test.
Example: We claim it takes Schemers longer to fix reported bugs
because they get distracted finding the perfect solution to the
problem. Querying the
Github API
indicates that the overall mean time to close an issue is 7.8 days
with a variance of 23.1 (made up numbers). Restricting this to
just the 5,197 Scheme repositories yields a mean time to close of
8.3 days. Is our claim valid? Assuming the distribution for the
time to fix bugs is normal, we can run a z-test
:
(z-test (summary-distribution 'mean: 8.3 'size: 5197)
(normal-distribution 7.8 23.1))
=> 0.05933335376636473
z-test
for sample
and
dist
meets the given alpha
level, default 5%.
(z-test? (summary-distribution 'mean: 8.3 'size: 5197)
(normal-distribution 7.8 23.1))
=> #f
dist
approximates sample
.
(z-statistic (summary-distribution 'mean: 8.3 'size: 5197)
(normal-distribution 7.8 23.1))
=> 1.5603943993695748
dist1
and dist2
have the same mean,
under the assumption the underlying distributions are normal.
tails
defaults to 1 and has the same semantics as in the
z-test
.
Example: You made a micro-optimization to your code but the
benchmark results are inconsistent, sometimes even being slower
than before. The average of the times is faster, but you're not
sure if this is a real improvement or just noise. Assuming the
noise is normally distributed, we can run a t-test
on the
times:
(let ((before '(30.02 29.99 30.11 29.97 30.01 29.99))
(after '(29.89 29.93 29.72 29.98 30.02 29.98)))
(t-test after before))
=> 0.04538666186968663
(gain (mean '(30.02 29.99 30.11 29.97 30.01 29.99))
(mean '(29.89 29.93 29.72 29.98 30.02 29.98)))
=> -0.0031650841246043267
t-test
for dist1
and
dist2
meets the given alpha
level, default 5%.
Example: Given a random sample of Schemer's IQs, can we assume
they are smarter than the overall population?
(t-test? '(120 115 95 135) (normal-distribution 100 225))
=> #f
dist1
matches dist2
.
(t-test? '(120 115 95 135) (normal-distribution 100 225))
=> #f
t-test
comparing
dist1
and dist2
.
(t-df '(120 115 95 135) (normal-distribution 100 225))
=> 3
observed
and expected
distributions. Returns the p-value.
Example: We want to know if our new PRNG is generating evenly
distributed die rolls. We can test this by comparing a histogram
of the values generated for 1 through 6, against the expected
uniform distribution:
(let ((rolls-histogram '#(5 8 9 8 10 20)))
(chi^2-test rolls-histogram (discrete-uniform-distribution 1 6)))
=> 0.019905220334775264
#t
iff the p-value from chi^2-test
is
less than alpha
(default 0.05), i.e. iff the two
distributions differ.
(let ((rolls-histogram '#(5 8 9 8 10 20)))
(chi^2-test? rolls-histogram (discrete-uniform-distribution 1 6)))
=> #t
chi^2-test
.
(let ((rolls-histogram '#(5 8 9 8 10 20)))
(chi^2-statistic rolls-histogram (discrete-uniform-distribution 1 6)))
=> 67/5
(chi^2-df '#(5 8 9 8 10 20))
=> 5
observed
and expected
distributions, which performs better than the
chi^2 test in many cases. Returns the p-value.
Example: Your university claimed equal enrollment for men and
women in Computer Science, but when you show up the first day to
COMP 101 there are only 42 girls out of a class of 100. Is the
42:58 ratio significantly different from the expected 50:50?
(g-test '(42 58) '(50 50))
=> 0.10883642860717235
#t
iff the p-value from g-test
is less
than alpha
(default 0.05) i.e. iff the two distributions
differ.
(g-test? '(42 58) '(50 50))
=> #f
g-test
.
(g-statistic '(42 58) '(50 50))
=> 2.571036073558357
observed
distribution
matches the expected
. This is more accurate than the
chi^2-test
and g-test
when there are few values.
Returns the p-value.
Example: A mobile game you play has an option to spin a virtual
capsule machine as an in-game purchase, with each spin giving you
a new character. They advertise a 3% chance of getting a rare
character. You pay for 100 spins and get only 1 rare character.
We can run a binomial-test
to determine if we believe
their advertised probability:
(binomial-test (bernoulli-trials 1 100) (mean-distribution 0.03))
=> 0.04755250792540563
#t
iff the p-value from binomial-test
is
less than alpha
, default 0.05.
(binomial-test? (bernoulli-trials 1 100) (mean-distribution 0.03))
=> #t
dist1
and dist2
. This is frequently used to
determine if two samples come from the same distribution, without
needing to make assumptions about that distribution (e.g. it need
not be normal).Returns #t
iff the kolmogorov-smirnoff-test
value between dist1
and dist2
is greater than the
kolmogorov-smirnoff-critical
value.Returns the underlying statistic for the
kolmogorov-smirnoff-test
.Performs a Wilcoxon signed ranked test to determine whether the
mean ranks of dist1
and dist2
(which must be paired,
i.e. the elements in aligned order) differ. It can be used as an
alternative to the t-test
when the distributions can't be
assumed to be normal. Returns #t
iff we fail to reject
the null hypothesis, i.e. we assume the distributions are the
same.Returns the underlying statistic for the wilcoxon-test?
.Returns the critical value for comparison against the statistic in
the wilcoxon-test?
.Performs an f-test of equality of variance of the distributions.
Returns the p-value.Returns #t
iff the f-test
is less than alpha
.Returns the underlying statistic for f-test
.
Returns the lower bound of the Wilson score interval, a binomial
proportion confidence interval. This interval gives a better
understanding of the probability of success from a series of
Bernoulli trials than the raw ratio, which can be misleading for a
small number of trials.Returns the upper bound of the Wilson score interval.Returns the offset used in wilson-lower-bound
and
wilson-upper-bound
.A similarity measure, also known as the Jaccard similarity
coefficient, returns the size of the intersection of seq1
and seq2
divided by the size of their union. The range is in
[0, 1], with 0 indicating no intersection and 1 indicating they
are the same sets. Equality is determined by the comparator
cmp
, defaulting to an equal?
comparator.Returns the dissimilarity between seq1
and seq2
, this
is 1 - the index.The Sørensen-Dice coefficient, aka Czekanowski's binary index,
Zijdenbos similarity index. Another similarity index, this is
essentially the set equivalent of the harmonic-mean
.
The corresponding distance function (subtracted from 1), unlike
Jaccard, is not a proper distance metric because it doesn't
satisfy the triangle inequality.
Computes a simple linear regression using ordinary least squares
on the paired sequences x
and y
and returns two
values: the y-intercept and the slope of the line closest fitting
the points. This is the univariate case and so simply called
fit-line
to distinguish from the multivariate linear
regression used in machine learning.
Example: The number of SRFIs has been increasing at a steady rate
in the past 20 years. Assuming the growth is linear, how many
final SRFIs do we expect in 2030?
(let ((srfis '((1999 9) (2000 16) (2001 19) (2002 30) (2003 34)
(2004 42) (2005 58) (2006 62) (2007 67) (2008 71)
(2009 72) (2010 73) (2012 74) (2013 82) (2014 85)
(2015 92) (2016 108) (2017 119) (2018 124)
(2019 138) (2020 151))))
(call-with-values (lambda () (fit-line (map car srfis) (map cadr srfis)))
(lambda (m b)
(let ((f (lambda (x) (inexact (+ (* m x) b)))))
(f 2030)))))
=> 193.28006039038067
fit-line
is a bad tool.
Returns x
raised to the 3rd power.Returns x
raised to the 4th power.Returns the sign of x
, either 0, 1 or -1.Returns n
! = 1*2*...*n
.Returns the log of the gamma function applied to x
,
i.e. just the first value of flloggamma
from SRFI 144.Returns the regularized lower incomplete gamma, γ(varsThe probability distribution function for the beta distribution.Returns the incomplete beta function, Β(a
, b
).Returns the quantile function, or inverse distribution function,
for the given cdf
, with statistics
used to determine
the domain.Utility to return a random value matching a given cdf
, with
statistics
used to determine the domain.Returns a procedure of one value which approximates the derivative
of f
.Also known as the binomial coefficient, returns the number of
combinations of k
elements from a set of size n
.Returns the number of permutations of k
ordered elements
from a set of size n
.Returns the sum of the elements of seq
, which must be real.
This and all operations involving sums are numerically stable.
The current implementation uses Neumaier's variant of Kahan
summation.
Examples:
(sum (make-list 10000 (acos -1)))
=> 31415.926535897932
(sum '#(1 1e100 1 -1e100))
=> 2.0
sum
but sums the square
s of the elements.Like sum
but sums the inverse (/
) of the elements.Like sum
but sums the log
of the elements.Generalized stable sum
. Applies proc
to the
sequences and returns the sum of all the results. It is an error
if the sequences don't all have the same length.Creates a new running-sum object initialized to init
,
defaulting to 0. A running-sum allows you take compute a
numerically stable sum incrementally, so that it can be updated
over time or fed from a generator or other data structure, etc.
Returns true iff x
is a running-sum object.
Returns the current value of the running-sum rs
.Adds x
to the running-sum rs
.Subtracts x
from the running-sum rs
.Returns the product of the elements in seq
. Multiplication
is stable so there is no need for error compensation.The relative change from x
to y
, useful to answer the
question what percentage the gain (or loss) was.
Examples:
(gain 50 60)
=> 1/5
(gain 2/3 1)
=> 1/2
(gain .75 .50)
=> -0.3333333333333333
max-size
elements. As new elements are added beyond
max-size
, the elements are probabilistic replaced such that
the sample has the same distribution as if random-sample
was run on the full population.Probabilistically adds elt
to the reservoir rv
.Returns a vector containing the current elements of the reservoir
rv
, which may share space with the underlying data structure.Returns a new empty histogram for describing a distribution using
up to max-bins
bins.
Returns #t
iff x
is a histogram.
Adds an element to a histogram.Returns a new histogram summarizing the elements of list ls
.Returns a new histogram summarizing the elements of vector vec
.Returns the smallest value seen by the histogram hist
.Returns the largest value seen by the histogram hist
.Returns the x
th quantile ([0, 1]) of the histogram hist
.Returns the x
th decile ([0, 10]) of the histogram hist
.Simple utility to plot a histogram with characters in a
fixed-width font.Computes the cumulative distribution function of hist
at
point x
.
- The Art of Computer Programming, Volume 2: Seminumerical Algorithms, Donald E. Knuth (1997)
- An Introduction to R, R Core Team (1999-2018)
- XLISP-STAT, Luke Tierney (1989)
- Apache Commons Math - Statistics, The Apache Software Foundation (2003-2016)
- Evaluating Kolmogorov’s Distribution, George Marsaglia et al (2003)
- SRFI 144 - Flonums, John Cowan and Will Clinger (2017)
- Rounding Error Analysis of Some Methods for Summing Finite Sums, Arnold Neumaier (1974)
- A Streaming Parallel Decision Tree Algorithm, Yael Ben-Haim et al (2010)
- Modes, Medians and Means: A Unifying Perspective, John Myles White (2013)
- On Average, You’re Using the Wrong Average, Daniel McNichol (2018)