Statistics is the branch of mathematics dealing with the collection and analysis of data. As we are increasingly overwhelmed with data it is important to be able to analyze that data, and so statistics should have a prominant role in any programming language's standard libraries. Broadly speaking statistics can be divided into descriptive statistics and inferential statistics. This library provides basic utilities for both of these covering many common use cases, while serving as a basis for more advanced uses.

This library takes a holistic view of a distribution, with procedures working on both theoretical and empirical distributions. Returns`#t`

iff x is a distribution.
Creates a new theoretical distribution. `name`

should be a
symbol and is used for debugging. The probability density
function `pdf`

, cumulative distribution function `cdf`

,
and summary `statistics`

are required. `random`

is
optionally and can be inferred from the `cdf`

but faster
implementations are usually available.Creates a new theoretical discrete distribution, similar to
`pure-distribution`

.Creates a new distribution for the sampled data in `seq`

.Equivalent to `distribution`

, but records that `seq`

represents the full population rather than a subsample.Creates a new running sample distribution to which new elements
can be added.Returns a new distribution from summmary statistics, without a
sample. Takes keyword arguments in the style of
`(chibi optional)`

, with the following values:
- name: used for debugging
- size: the population size
- mean: the mean (average) value
- median: the median (middle) value
- mode: the mode (most common) value
- variance: the variance
- skew: the skew
- kurtosis: the kurtosis
- minimum: the minimum value
- maximum: the maximum value
- discrete?: true iff this is a discrete distribution
- population?: true iff this is a population (as opposed to sample)

`(mean (summary-distribution 'mean: 0.5 'variance: 2.5))`

`=> 0.5`

`#t`

iff `dist1`

is a theoretical distribution
(is derived from a pdf).Returns `#t`

iff `dist1`

is a sample distribution to
which new elements can be added.Adds a new element `elt`

to the sample distribution `dist`

.
Returns the pdf if `dist`

is pure a distribution, `#f`

otherwise.
Returns the cdf if `dist`

is pure a distribution, `#f`

otherwise.
Returns the underlying values, or sampled representatives, of
`dist`

, if available. If you want a sample from any
distribution, use `random-sample`

.Returns a new normal (gaussian) distribution, with the given
`mean`

and `variance`

. Also know as a bell curve, many
things are normally distributed, and the Central Limit THeorem
says that the sum of independent random variables tends towards
normal.Returns a new continuous uniform distribution over [`a`

,
`b`

]. Every value in the range is equally likely.Returns a new discrete uniform distribution over [`a`

,
`b`

], in `step`

intervals. Every value in the range is
equally likely.Returns a new poisson distribution, a discrete distribution
measuring the probability that a certain number of events occur
within a certain period of time.Returns a new kolmogorov distribution, the distribution of the
suprememum of the Brownian bridge over [0, 1]. This is used in
the Kolmogorov-Smirnoff test.Returns a new binomial distribution, a discrete distribution of
the number of successes over `n`

Bernoulli trials with
probability `p`

.Returns a new F-distribution, aka the Fisher–Snedecor
distribution, used in the F-test for ANOVA.
Descriptive statistics summarize some aspect of a distribution.
The following procedures all take either a distribution, or for
convenience a list or vector as an implicit sample distribution (as
constructed by the `distribution`

procedure).
If a list of vector is passed, an optional getter argument is
allowed to return the value of each element in the sequence. This
is to allow idioms similar to dataframes in R and numpy, where the
sequence elements could be vectors or records containing multiple
fields.
If a distribution is passed, we may not have complete information
for that distribution, so any of these may return `#f`

.
Returns the size of the distribution population, which may be
+inf.0 for pure distributions, or #f if unknown.
`(size '(1 2 3))`

`=> 3`

`(size (normal-distribution 0 1))`

`=> +inf.0`

`mean`

,
`geometric-mean`

and `harmonic-mean`

are the three
classical Pythagorean means. The `mean`

, `median`

and `mode`

can be unified as:
```
(mode xi) => (argmin s (sum (expt (abs (- xi s)) 0)))
(median xi) => (argmin s (sum (expt (abs (- xi s)) 1)))
(mean xi) => (argmin s (sum (expt (abs (- xi s)) 2)))
```

Returns the arithmetic-mean, aka the first moment, what people
typically think of as the average. As with other procedures in
this library, the computation should be numerically stable.
`(mean '(1 2 3))`

`=> 2`

`(mean '((a . 1) (b . 2) (c . 3)) cdr)`

`=> 2`

`(mean (make-list 1000000 #i1/3))`

`=> 0.3333333333333333`

`(mean (normal-distribution 2.4 7))`

`=> 2.4`

`geometric-mean`

is often used for numbers whose values
are exponential in nature, such as population growth or financial
investments.
Returns the harmonic mean, aka the F-1 score, of the distribution.
This is the reciprocal of the arithmetic `mean`

of the
reciprocals of the values. The `harmonic-mean`

is often
used in machine learning as an aggregated performance score over
other metrics (such as precision and recall), which penalizes a
single low score harder than the arithmetic or geometric means
would.
Returns the quadratic mean of the distribution, aka the
root-mean-squared (RMS), which is the square root of the mean
square (the generalized mean with exponent 2). The RMS of a
deviation or error (RMSE) is frequently used in machine learning
to evaluate the fitness of a model.
Returns the median of a distribution, the 50th percentile or
"middle" value if the elements were sorted.
Returns the mode, or most frequently occuring element of the
distribution.
Returns the variance, aka the second central moment, of a
distribution, defined as the expectation of the squared deviation
from the mean. Variance is used extensively in descriptive and
inferential statistics.
Returns the standard deviation, the square root of the
`variance`

. Unlike `variance`

this is expressed in
the same unit as the data, and can thus be easier to reason about.
In particular, the 68-95-99.7 rule states that 68% of the
population is within 1 stdev of the mean, 95% within 2 stdevs, and
99.7% within 3 stdevs.
A simple form of outlier removal is to remove data points more
than 3 (or 4) standard deviations from the mean.
An alias for `standard-deviation`

.
Returns the coefficient of variation, aka the relative standard
deviation (RSD) of `dist`

, the ratio of the
`standard-deviation`

to the `mean`

.
Returns the covariance, a measure of the joint variability, of the
two (non-pure) distributions `dist1`

and `dist2`

. The
sign of the covariance shows the tendency in the linear
relationship between the variables, though the magnitude is hard
to interpret. For a better understanding of the strength of the
relation, use the `pearson-correlation`

.
Returns the Pearson correlation coefficient (PCC), aka Pearson's
r, the Pearson product-moment correlation coefficient (PPMCC), or
the bivariate correlation. This is a value between [-1, 1]
showing the correlation between the two distributions `dist1`

and `dist2`

, with positive values indicating a positive linear
correlation, negative values indicating a negative linear
correlation, and values close to zero indicating no correlation.
Returns the Spearman's rank correlation coefficient, aka
Spearman's ρ. This is the `pearson-correlation`

between
the rank variables of `dist1`

and `dist2`

, indicating how
well their relationship can be described by a monotonic function.
Returns the skewness, aka the third standardized moment, of
`dist`

, which indicates the lopsidedness of the distribution.
For a unimodal distribution, negative skew indicates the tail is
to the left of the mode, and positive skew indicates it is to the
right. Zero skew indicates the tails balance out, though may not
be symmetric.
Returns the kurtosis, aka the fourth standardizes moment, of
`dist`

, indicating the heaviness of the tail. A
`kurtosis`

less than 3 is called "platykurtic", suggesting
the distribution is flat or has fewer extremes than a normal
distribution, whereas a `kurtosis`

greater than 3 is called
"leptokurtic", and has more outliers than a normal distribution.
Returns the excess kurtosis, aka Pearson's kurtosis, of
`dist`

. The `kurtosis`

of any normal distribution is
3, so this is defined as the kurtosis minus 3.
Returns the maximum value of `dist`

.
Returns the minimum value of `dist`

.
Returns the `n`

th percentile of `dist`

. If `dist`

is
non-pure and the rank doesn't fall on an exact position, uses
linear interpolation between adjacent ranks, which produces better
results for small samples.
Ranks a sorted list (x1 x2 ... xn) returning
`((rank1 . x1) (rank2 . x2) ... xn)`

where the ranks are 1..n if there all xi are unique,
otherwise the average rank among equal elements.Returns a new list with the elements replaced by their rank.
Returns a random flip of a weighted coin, i.e. runs a Bernoulli
trial, returning `#t`

with probability `p`

(default
0.5) and `#f`

otherwise.Returns a random value from distribution `dist`

.Returns a sequence of `n`

random value from distribution
`dist`

. If `dist`

is a list or distribution based on one
the result is a list, otherwise it's a vector.Like `random-sample`

, but returns a sample with replacement.
Compute the minimum sample size for a population num-total, with
the given confidence, error bound requirements, and population
proportion.
- num-total: total population size (+inf.0 uses the normal approximation to binomial dist, otherwise we use the normal approximation to the hypergeometric dist)
- conf: desired level of confidence, default 1.96 (95%)
- err: desired error bounds, default 3%
- p: population proportion, default 0.5

`(min-sample-size +inf.0 2.75 0.01)`

`=> 18906.25`

`(min-sample-size 73015 2.75 0.01)`

`=> 15017.799056791075`

`sample`

can be approximated
by a normal distribution. Returns the probability that
`sample`

is consistent with `dist`

, default one-tailed.
This is more convenient than the T-test if the population size is
large or the variance is known.
`tails`

must be 1 (the default) for a one-tailed test, or 2
for a two-tailed test. If you are testing whether a mean is
greater (or less) than the population, you want a one-tailed test,
but if you are just testing whether it is different (in either
direction) you want a two-tailed test.
Example: We claim it takes Schemers longer to fix reported bugs
because they get distracted finding the perfect solution to the
problem. Querying the
Github API
indicates that the overall mean time to close an issue is 7.8 days
with a variance of 23.1 (made up numbers). Restricting this to
just the 5,197 Scheme repositories yields a mean time to close of
8.3 days. Is our claim valid? Assuming the distribution for the
time to fix bugs is normal, we can run a `z-test`

:
```
(z-test (summary-distribution 'mean: 8.3 'size: 5197)
(normal-distribution 7.8 23.1))
```

`=> 0.05933335376636473`

`z-test`

for `sample`

and
`dist`

meets the given `alpha`

level, default 5%.
```
(z-test? (summary-distribution 'mean: 8.3 'size: 5197)
(normal-distribution 7.8 23.1))
```

`=> #f`

`dist`

approximates `sample`

.
```
(z-statistic (summary-distribution 'mean: 8.3 'size: 5197)
(normal-distribution 7.8 23.1))
```

`=> 1.5603943993695748`

`dist1`

and `dist2`

have the same mean,
under the assumption the underlying distributions are normal.
`tails`

defaults to 1 and has the same semantics as in the
`z-test`

.
Example: You made a micro-optimization to your code but the
benchmark results are inconsistent, sometimes even being slower
than before. The average of the times is faster, but you're not
sure if this is a real improvement or just noise. Assuming the
noise is normally distributed, we can run a `t-test`

on the
times:
```
(let ((before '(30.02 29.99 30.11 29.97 30.01 29.99))
(after '(29.89 29.93 29.72 29.98 30.02 29.98)))
(t-test after before))
```

`=> 0.04538666186968663`

```
(gain (mean '(30.02 29.99 30.11 29.97 30.01 29.99))
(mean '(29.89 29.93 29.72 29.98 30.02 29.98)))
```

`=> -0.0031650841246043267`

`t-test`

for `dist1`

and
`dist2`

meets the given `alpha`

level, default 5%.
Example: Given a random sample of Schemer's IQs, can we assume
they are smarter than the overall population?
`(t-test? '(120 115 95 135) (normal-distribution 100 225))`

`=> #f`

`dist1`

matches `dist2`

.
`(t-test? '(120 115 95 135) (normal-distribution 100 225))`

`=> #f`

`t-test`

comparing
`dist1`

and `dist2`

.
`(t-df '(120 115 95 135) (normal-distribution 100 225))`

`=> 3`

`observed`

and `expected`

distributions. Returns the p-value.
Example: We want to know if our new PRNG is generating evenly
distributed die rolls. We can test this by comparing a histogram
of the values generated for 1 through 6, against the expected
uniform distribution:
```
(let ((rolls-histogram '#(5 8 9 8 10 20)))
(chi^2-test rolls-histogram (discrete-uniform-distribution 1 6)))
```

`=> 0.019905220334775264`

`#t`

iff the p-value from `chi^2-test`

is
less than `alpha`

(default 0.05), i.e. iff the two
distributions differ.
```
(let ((rolls-histogram '#(5 8 9 8 10 20)))
(chi^2-test? rolls-histogram (discrete-uniform-distribution 1 6)))
```

`=> #t`

`chi^2-test`

.
```
(let ((rolls-histogram '#(5 8 9 8 10 20)))
(chi^2-statistic rolls-histogram (discrete-uniform-distribution 1 6)))
```

`=> 67/5`

`(chi^2-df '#(5 8 9 8 10 20))`

`=> 5`

`observed`

and `expected`

distributions, which performs better than the
chi^2 test in many cases. Returns the p-value.
Example: Your university claimed equal enrollment for men and
women in Computer Science, but when you show up the first day to
COMP 101 there are only 42 girls out of a class of 100. Is the
42:58 ratio significantly different from the expected 50:50?
`(g-test '(42 58) '(50 50))`

`=> 0.10883642860717235`

`#t`

iff the p-value from `g-test`

is less
than `alpha`

(default 0.05) i.e. iff the two distributions
differ.
`(g-test? '(42 58) '(50 50))`

`=> #f`

`g-test`

.
`(g-statistic '(42 58) '(50 50))`

`=> 2.571036073558357`

`observed`

distribution
matches the `expected`

. This is more accurate than the
`chi^2-test`

and `g-test`

when there are few values.
Returns the p-value.
Example: A mobile game you play has an option to spin a virtual
capsule machine as an in-game purchase, with each spin giving you
a new character. They advertise a 3% chance of getting a rare
character. You pay for 100 spins and get only 1 rare character.
We can run a `binomial-test`

to determine if we believe
their advertised probability:
`(binomial-test (bernoulli-trials 1 100) (mean-distribution 0.03))`

`=> 0.04755250792540563`

`#t`

iff the p-value from `binomial-test`

is
less than `alpha`

, default 0.05.
`(binomial-test? (bernoulli-trials 1 100) (mean-distribution 0.03))`

`=> #t`

`dist1`

and `dist2`

. This is frequently used to
determine if two samples come from the same distribution, without
needing to make assumptions about that distribution (e.g. it need
not be normal).Returns `#t`

iff the `kolmogorov-smirnoff-test`

value between `dist1`

and `dist2`

is greater than the
`kolmogorov-smirnoff-critical`

value.Returns the underlying statistic for the
`kolmogorov-smirnoff-test`

.Performs a Wilcoxon signed ranked test to determine whether the
mean ranks of `dist1`

and `dist2`

(which must be paired,
i.e. the elements in aligned order) differ. It can be used as an
alternative to the `t-test`

when the distributions can't be
assumed to be normal. Returns `#t`

iff we fail to reject
the null hypothesis, i.e. we assume the distributions are the
same.Returns the underlying statistic for the `wilcoxon-test?`

.Returns the critical value for comparison against the statistic in
the `wilcoxon-test?`

.Performs an f-test of equality of variance of the distributions.
Returns the p-value.Returns `#t`

iff the `f-test`

is less than `alpha`

.Returns the underlying statistic for `f-test`

.
Returns the lower bound of the Wilson score interval, a binomial
proportion confidence interval. This interval gives a better
understanding of the probability of success from a series of
Bernoulli trials than the raw ratio, which can be misleading for a
small number of trials.Returns the upper bound of the Wilson score interval.Returns the offset used in `wilson-lower-bound`

and
`wilson-upper-bound`

.A similarity measure, also known as the Jaccard similarity
coefficient, returns the size of the intersection of `seq1`

and `seq2`

divided by the size of their union. The range is in
[0, 1], with 0 indicating no intersection and 1 indicating they
are the same sets. Equality is determined by the comparator
`cmp`

, defaulting to an `equal?`

comparator.Returns the dissimilarity between `seq1`

and `seq2`

, this
is 1 - the index.The Sørensen-Dice coefficient, aka Czekanowski's binary index,
Zijdenbos similarity index. Another similarity index, this is
essentially the set equivalent of the `harmonic-mean`

.
The corresponding distance function (subtracted from 1), unlike
Jaccard, is not a proper distance metric because it doesn't
satisfy the triangle inequality.
Computes a simple linear regression using ordinary least squares
on the paired sequences `x`

and `y`

and returns two
values: the y-intercept and the slope of the line closest fitting
the points. This is the univariate case and so simply called
`fit-line`

to distinguish from the multivariate linear
regression used in machine learning.
Example: The number of SRFIs has been increasing at a steady rate
in the past 20 years. Assuming the growth is linear, how many
final SRFIs do we expect in 2030?
```
(let ((srfis '((1999 9) (2000 16) (2001 19) (2002 30) (2003 34)
(2004 42) (2005 58) (2006 62) (2007 67) (2008 71)
(2009 72) (2010 73) (2012 74) (2013 82) (2014 85)
(2015 92) (2016 108) (2017 119) (2018 124)
(2019 138) (2020 151))))
(call-with-values (lambda () (fit-line (map car srfis) (map cadr srfis)))
(lambda (m b)
(let ((f (lambda (x) (inexact (+ (* m x) b)))))
(f 2030)))))
```

`=> 193.28006039038067`

`fit-line`

is a bad tool.
Returns `x`

raised to the 3rd power.Returns `x`

raised to the 4th power.Returns the sign of `x`

, either 0, 1 or -1.Returns `n`

! = 1*2*...*`n`

.Returns the log of the gamma function applied to `x`

,
i.e. just the first value of `flloggamma`

from SRFI 144.Returns the regularized lower incomplete gamma, γ(varsThe probability distribution function for the beta distribution.Returns the incomplete beta function, Β(`a`

, `b`

).Returns the quantile function, or inverse distribution function,
for the given `cdf`

, with `statistics`

used to determine
the domain.Utility to return a random value matching a given `cdf`

, with
`statistics`

used to determine the domain.Returns a procedure of one value which approximates the derivative
of `f`

.Also known as the binomial coefficient, returns the number of
combinations of `k`

elements from a set of size `n`

.Returns the number of permutations of `k`

ordered elements
from a set of size `n`

.Returns the sum of the elements of `seq`

, which must be real.
This and all operations involving sums are numerically stable.
The current implementation uses Neumaier's variant of Kahan
summation.
Examples:
`(sum (make-list 10000 (acos -1)))`

`=> 31415.926535897932`

`(sum '#(1 1e100 1 -1e100))`

`=> 2.0`

`sum`

but sums the `square`

s of the elements.Like `sum`

but sums the inverse (`/`

) of the elements.Like `sum`

but sums the `log`

of the elements.Generalized stable `sum`

. Applies `proc`

to the
sequences and returns the sum of all the results. It is an error
if the sequences don't all have the same length.Creates a new running-sum object initialized to `init`

,
defaulting to 0. A running-sum allows you take compute a
numerically stable sum incrementally, so that it can be updated
over time or fed from a generator or other data structure, etc.
Returns true iff `x`

is a running-sum object.
Returns the current value of the running-sum `rs`

.Adds `x`

to the running-sum `rs`

.Subtracts `x`

from the running-sum `rs`

.Returns the product of the elements in `seq`

. Multiplication
is stable so there is no need for error compensation.The relative change from `x`

to `y`

, useful to answer the
question what percentage the gain (or loss) was.
Examples:
`(gain 50 60)`

`=> 1/5`

`(gain 2/3 1)`

`=> 1/2`

`(gain .75 .50)`

`=> -0.3333333333333333`

`max-size`

elements. As new elements are added beyond
`max-size`

, the elements are probabilistic replaced such that
the sample has the same distribution as if `random-sample`

was run on the full population.Probabilistically adds `elt`

to the reservoir `rv`

.Returns a vector containing the current elements of the reservoir
`rv`

, which may share space with the underlying data structure.Returns a new empty histogram for describing a distribution using
up to `max-bins`

bins.
Returns `#t`

iff `x`

is a histogram.
Adds an element to a histogram.Returns a new histogram summarizing the elements of list `ls`

.Returns a new histogram summarizing the elements of vector `vec`

.Returns the smallest value seen by the histogram `hist`

.Returns the largest value seen by the histogram `hist`

.Returns the `x`

th quantile ([0, 1]) of the histogram `hist`

.Returns the `x`

th decile ([0, 10]) of the histogram `hist`

.Simple utility to plot a histogram with characters in a
fixed-width font.Computes the cumulative distribution function of `hist`

at
point `x`

.
- The Art of Computer Programming, Volume 2: Seminumerical Algorithms, Donald E. Knuth (1997)
- An Introduction to R, R Core Team (1999-2018)
- XLISP-STAT, Luke Tierney (1989)
- Apache Commons Math - Statistics, The Apache Software Foundation (2003-2016)
- Evaluating Kolmogorov’s Distribution, George Marsaglia et al (2003)
- SRFI 144 - Flonums, John Cowan and Will Clinger (2017)
- Rounding Error Analysis of Some Methods for Summing Finite Sums, Arnold Neumaier (1974)
- A Streaming Parallel Decision Tree Algorithm, Yael Ben-Haim et al (2010)
- Modes, Medians and Means: A Unifying Perspective, John Myles White (2013)
- On Average, You’re Using the Wrong Average, Daniel McNichol (2018)