(chibi math stats)

Statistics is the branch of mathematics dealing with the collection and analysis of data. As we are increasingly overwhelmed with data it is important to be able to analyze that data, and so statistics should have a prominant role in any programming language's standard libraries. Broadly speaking statistics can be divided into descriptive statistics and inferential statistics. This library provides basic utilities for both of these covering many common use cases, while serving as a basis for more advanced uses.

Distributions

This library takes a holistic view of a distribution, with procedures working on both theoretical and empirical distributions.

Distribution Type

(distribution? x)

Returns #t iff x is a distribution.

(pure-distribution name pdf cdf statistics [random])

Creates a new theoretical distribution. name should be a symbol and is used for debugging. The probability density function pdf, cumulative distribution function cdf, and summary statistics are required. random is optionally and can be inferred from the cdf but faster implementations are usually available.

(pure-discrete-distribution name pmf cdf statistics [random])

Creates a new theoretical discrete distribution, similar to pure-distribution.

(distribution seq)

Creates a new distribution for the sampled data in seq.

(population seq)

Equivalent to distribution, but records that seq represents the full population rather than a subsample.

(open-distribution)

Creates a new running sample distribution to which new elements can be added.

(summary-distribution . o)

Returns a new distribution from summmary statistics, without a sample. Takes keyword arguments in the style of (chibi optional), with the following values: Example: create a summary given just a mean and variance:
(mean (summary-distribution 'mean: 0.5 'variance: 2.5))
=> 0.5

(bernoulli-trials successes trials)

A utility for a summary distribution of bernoulli trials, based on the number of successes out of the total number of trials.

(mean-distribution mean)

A utility for a summary distribution where we only know the mean.

(summarize dist)

Returns a new distribution containing only the summary statistics of the given distribution.

(distribution-pure? dist)

Returns #t iff dist1 is a theoretical distribution (is derived from a pdf).

(distribution-open? dist)

Returns #t iff dist1 is a sample distribution to which new elements can be added.

(distribution-add! dist elt)

Adds a new element elt to the sample distribution dist.

(distribution-pdf dist)

Returns the pdf if dist is pure a distribution, #f otherwise.

(distribution-cdf dist)

Returns the cdf if dist is pure a distribution, #f otherwise. Returns the underlying values, or sampled representatives, of dist, if available. If you want a sample from any distribution, use random-sample.

Standard Distributions

(normal-distribution [mean variance])

Returns a new normal (gaussian) distribution, with the given mean and variance. Also know as a bell curve, many things are normally distributed, and the Central Limit THeorem says that the sum of independent random variables tends towards normal.

(uniform-distribution a b . o)

Returns a new continuous uniform distribution over [a, b]. Every value in the range is equally likely.

(discrete-uniform-distribution a b . o)

Returns a new discrete uniform distribution over [a, b], in step intervals. Every value in the range is equally likely.

(poisson-distribution lam k . o)

Returns a new poisson distribution, a discrete distribution measuring the probability that a certain number of events occur within a certain period of time.

(kolmogorov-distribution n)

Returns a new kolmogorov distribution, the distribution of the suprememum of the Brownian bridge over [0, 1]. This is used in the Kolmogorov-Smirnoff test.

(binomial-distribution n p)

Returns a new binomial distribution, a discrete distribution of the number of successes over n Bernoulli trials with probability p.

(f-distribution d1 d2)

Returns a new F-distribution, aka the Fisher–Snedecor distribution, used in the F-test for ANOVA.

Descriptive Statistics

Descriptive statistics summarize some aspect of a distribution. The following procedures all take either a distribution, or for convenience a list or vector as an implicit sample distribution (as constructed by the distribution procedure). If a list of vector is passed, an optional getter argument is allowed to return the value of each element in the sequence. This is to allow idioms similar to dataframes in R and numpy, where the sequence elements could be vectors or records containing multiple fields. If a distribution is passed, we may not have complete information for that distribution, so any of these may return #f.

(size dist [getter])

Returns the size of the distribution population, which may be +inf.0 for pure distributions, or #f if unknown.
(size '(1 2 3))
=> 3
(size (normal-distribution 0 1))
=> +inf.0

Central Tendencies

Central tendencies measure a central or typical value of a distribution. The arithmetic mean, geometric-mean and harmonic-mean are the three classical Pythagorean means. The mean, median and mode can be unified as:
 (mode xi)   => (argmin s (sum (expt (abs (- xi s)) 0)))
 (median xi) => (argmin s (sum (expt (abs (- xi s)) 1)))
 (mean xi)   => (argmin s (sum (expt (abs (- xi s)) 2)))

(mean dist [getter])

Returns the arithmetic-mean, aka the first moment, what people typically think of as the average. As with other procedures in this library, the computation should be numerically stable.
(mean '(1 2 3))
=> 2
(mean '((a . 1) (b . 2) (c . 3)) cdr)
=> 2
(mean (make-list 1000000 #i1/3))
=> 0.3333333333333333
(mean (normal-distribution 2.4 7))
=> 2.4

(geometric-mean dist [getter])

Returns the geometric mean, which is the central tendency based on the product of the values, rather than the sum as in the arithmetic mean. In geometric terms, the geometric mean of two numbers, a and b, is the length of one side of a square whose area is equal to that of a rectangle with sides a and b. The geometric-mean is often used for numbers whose values are exponential in nature, such as population growth or financial investments.

(harmonic-mean dist [getter])

Returns the harmonic mean, aka the F-1 score, of the distribution. This is the reciprocal of the arithmetic mean of the reciprocals of the values. The harmonic-mean is often used in machine learning as an aggregated performance score over other metrics (such as precision and recall), which penalizes a single low score harder than the arithmetic or geometric means would.

(quadratic-mean dist [getter])

Returns the quadratic mean of the distribution, aka the root-mean-squared (RMS), which is the square root of the mean square (the generalized mean with exponent 2). The RMS of a deviation or error (RMSE) is frequently used in machine learning to evaluate the fitness of a model.

(median dist [getter])

Returns the median of a distribution, the 50th percentile or "middle" value if the elements were sorted.

(mode dist [getter])

Returns the mode, or most frequently occuring element of the distribution.

Central Moments

(variance dist [getter])

Returns the variance, aka the second central moment, of a distribution, defined as the expectation of the squared deviation from the mean. Variance is used extensively in descriptive and inferential statistics.

(standard-deviation dist [getter])

Returns the standard deviation, the square root of the variance. Unlike variance this is expressed in the same unit as the data, and can thus be easier to reason about. In particular, the 68-95-99.7 rule states that 68% of the population is within 1 stdev of the mean, 95% within 2 stdevs, and 99.7% within 3 stdevs. A simple form of outlier removal is to remove data points more than 3 (or 4) standard deviations from the mean.

(stdev dist [getter])

An alias for standard-deviation.

(coefficient-of-variation dist [getter])

Returns the coefficient of variation, aka the relative standard deviation (RSD) of dist, the ratio of the standard-deviation to the mean.

(covariance dist1 dist2 [getter1 [getter2]])

(covariance dist1 dist2 . o)

Returns the covariance, a measure of the joint variability, of the two (non-pure) distributions dist1 and dist2. The sign of the covariance shows the tendency in the linear relationship between the variables, though the magnitude is hard to interpret. For a better understanding of the strength of the relation, use the pearson-correlation.

(pearson-correlation dist1 dist2 [getter1 [getter2]])

Returns the Pearson correlation coefficient (PCC), aka Pearson's r, the Pearson product-moment correlation coefficient (PPMCC), or the bivariate correlation. This is a value between [-1, 1] showing the correlation between the two distributions dist1 and dist2, with positive values indicating a positive linear correlation, negative values indicating a negative linear correlation, and values close to zero indicating no correlation.

(spearman-rank-correlation dist1 dist2 [getter1 [getter2]])

Returns the Spearman's rank correlation coefficient, aka Spearman's ρ. This is the pearson-correlation between the rank variables of dist1 and dist2, indicating how well their relationship can be described by a monotonic function.

(skewness dist [getter])

Returns the skewness, aka the third standardized moment, of dist, which indicates the lopsidedness of the distribution. For a unimodal distribution, negative skew indicates the tail is to the left of the mode, and positive skew indicates it is to the right. Zero skew indicates the tails balance out, though may not be symmetric.

(kurtosis dist [getter])

Returns the kurtosis, aka the fourth standardizes moment, of dist, indicating the heaviness of the tail. A kurtosis less than 3 is called "platykurtic", suggesting the distribution is flat or has fewer extremes than a normal distribution, whereas a kurtosis greater than 3 is called "leptokurtic", and has more outliers than a normal distribution.

(excess-kurtosis dist [getter])

Returns the excess kurtosis, aka Pearson's kurtosis, of dist. The kurtosis of any normal distribution is 3, so this is defined as the kurtosis minus 3.

Rank Statistics

(maximum dist [getter])

Returns the maximum value of dist.

(minimum dist [getter])

Returns the minimum value of dist.

(percentile dist n [getter])

Returns the nth percentile of dist. If dist is non-pure and the rank doesn't fall on an exact position, uses linear interpolation between adjacent ranks, which produces better results for small samples.

(rank ls [getter])

Ranks a sorted list (x1 x2 ... xn) returning
((rank1 . x1) (rank2 . x2) ... xn)
where the ranks are 1..n if there all xi are unique, otherwise the average rank among equal elements.

(map-rank dist [less get-key])

Returns a new list with the elements replaced by their rank.

Sampling

(flip? [p])

Returns a random flip of a weighted coin, i.e. runs a Bernoulli trial, returning #t with probability p (default 0.5) and #f otherwise.

(random-pick dist)

Returns a random value from distribution dist.

(random-sample n dist . o)

Returns a sequence of n random value from distribution dist. If dist is a list or distribution based on one the result is a list, otherwise it's a vector.

(make-multi-sampler seq get-norm-weight)

(random-multi-sample n x . o)

Like random-sample, but returns a sample with replacement.

(min-sample-size num-total [conf err p])

Compute the minimum sample size for a population num-total, with the given confidence, error bound requirements, and population proportion. For example, assume you want a 99% confidence (conf: 2.75) that the result is within 1%, then:
(min-sample-size +inf.0 2.75 0.01)
=> 18906.25
however if you know the total population the required size may be smaller (but never larger):
(min-sample-size 73015 2.75 0.01)
=> 15017.799056791075

Inferential Statistics

Hypothesis Testing

(z-test sample dist [tails])

Performs a Z-test, a test that a sample can be approximated by a normal distribution. Returns the probability that sample is consistent with dist, default one-tailed. This is more convenient than the T-test if the population size is large or the variance is known. tails must be 1 (the default) for a one-tailed test, or 2 for a two-tailed test. If you are testing whether a mean is greater (or less) than the population, you want a one-tailed test, but if you are just testing whether it is different (in either direction) you want a two-tailed test. Example: We claim it takes Schemers longer to fix reported bugs because they get distracted finding the perfect solution to the problem. Querying the Github API indicates that the overall mean time to close an issue is 7.8 days with a variance of 23.1 (made up numbers). Restricting this to just the 5,197 Scheme repositories yields a mean time to close of 8.3 days. Is our claim valid? Assuming the distribution for the time to fix bugs is normal, we can run a z-test:
(z-test (summary-distribution 'mean: 8.3 'size: 5197)
        (normal-distribution 7.8 23.1))
=> 0.05933335376636473
This is greater than 0.05, so at a 95% confidence interval we would reject our hypothesis - Schemers are just as slow as everbody else.

(z-test? sample dist [alpha tails])

Returns true iff the z-test for sample and dist meets the given alpha level, default 5%.
(z-test? (summary-distribution 'mean: 8.3 'size: 5197)
         (normal-distribution 7.8 23.1))
=> #f

(z-statistic sample dist)

Returns the underlying Z statistic showing how well dist approximates sample.
(z-statistic (summary-distribution 'mean: 8.3 'size: 5197)
             (normal-distribution 7.8 23.1))
=> 1.5603943993695748

(t-test dist1 dist2 [tails])

AKA the Student's t-test, after the pen name "Student" used by William Sealy Gosset. Returns the probability that the two distributions dist1 and dist2 have the same mean, under the assumption the underlying distributions are normal. tails defaults to 1 and has the same semantics as in the z-test. Example: You made a micro-optimization to your code but the benchmark results are inconsistent, sometimes even being slower than before. The average of the times is faster, but you're not sure if this is a real improvement or just noise. Assuming the noise is normally distributed, we can run a t-test on the times:
(let ((before '(30.02 29.99 30.11 29.97 30.01 29.99))
      (after  '(29.89 29.93 29.72 29.98 30.02 29.98)))
  (t-test after before))
=> 0.04538666186968663
This is less than a 0.05 alpha, so we can assume the improvement is significant, if small (0.3%):
(gain (mean '(30.02 29.99 30.11 29.97 30.01 29.99))
      (mean '(29.89 29.93 29.72 29.98 30.02 29.98)))
=> -0.0031650841246043267

(t-test? dist1 dist2 [alpha tails])

Returns true iff the t-test for dist1 and dist2 meets the given alpha level, default 5%. Example: Given a random sample of Schemer's IQs, can we assume they are smarter than the overall population?
(t-test? '(120 115 95 135) (normal-distribution 100 225))
=> #f

(t-statistic dist1 dist2)

Returns the underlying t statistic showing how well dist1 matches dist2.
(t-test? '(120 115 95 135) (normal-distribution 100 225))
=> #f

(t-df dist1 dist2)

Returns the degrees of freedom for the t-test comparing dist1 and dist2.
(t-df '(120 115 95 135) (normal-distribution 100 225))
=> 3

(chi^2-test observed expected)

Performs Pearson's chi-squared test (χ²) to determine the goodness of fit between the observed and expected distributions. Returns the p-value. Example: We want to know if our new PRNG is generating evenly distributed die rolls. We can test this by comparing a histogram of the values generated for 1 through 6, against the expected uniform distribution:
(let ((rolls-histogram '#(5 8 9 8 10 20)))
  (chi^2-test rolls-histogram (discrete-uniform-distribution 1 6)))
=> 0.019905220334775264
Since this is less than the 0.05 alpha we reject the null hypothesis that the PRNG generates a uniform distribution, i.e. it must be biased.

(chi^2-test? observed expected [alpha])

Returns #t iff the p-value from chi^2-test is less than alpha (default 0.05), i.e. iff the two distributions differ.
(let ((rolls-histogram '#(5 8 9 8 10 20)))
  (chi^2-test? rolls-histogram (discrete-uniform-distribution 1 6)))
=> #t

(chi^2-statistic observed expected)

Returns the underlying chi^2 statistic for chi^2-test.
(let ((rolls-histogram '#(5 8 9 8 10 20)))
  (chi^2-statistic rolls-histogram (discrete-uniform-distribution 1 6)))
=> 67/5

(chi^2-df observed [params])

Returns the degrees of freedom for a chi^2 goodness of fit test (as well as g-test).
(chi^2-df '#(5 8 9 8 10 20))
=> 5

(g-test observed expected)

Performs a G-test for goodness of fit between the observed and expected distributions, which performs better than the chi^2 test in many cases. Returns the p-value. Example: Your university claimed equal enrollment for men and women in Computer Science, but when you show up the first day to COMP 101 there are only 42 girls out of a class of 100. Is the 42:58 ratio significantly different from the expected 50:50?
(g-test '(42 58) '(50 50))
=> 0.10883642860717235
Since this is greater than a 0.05 alpha we can't assume it is significant, i.e. this split is within the range of the universities' claim.

(g-test? observed expected [alpha])

Returns #t iff the p-value from g-test is less than alpha (default 0.05) i.e. iff the two distributions differ.
(g-test? '(42 58) '(50 50))
=> #f

(g-statistic observed expected)

Returns the underlying statistic the for g-test.
(g-statistic '(42 58) '(50 50))
=> 2.571036073558357

(binomial-test observed expected [tails])

Performs a binomial test that the observed distribution matches the expected. This is more accurate than the chi^2-test and g-test when there are few values. Returns the p-value. Example: A mobile game you play has an option to spin a virtual capsule machine as an in-game purchase, with each spin giving you a new character. They advertise a 3% chance of getting a rare character. You pay for 100 spins and get only 1 rare character. We can run a binomial-test to determine if we believe their advertised probability:
(binomial-test (bernoulli-trials 1 100) (mean-distribution 0.03))
=> 0.04755250792540563
The p-value is less than 0.05 so you should probably demand your money back.

(binomial-test? observed expected [tails alpha])

Returns #t iff the p-value from binomial-test is less than alpha, default 0.05.
(binomial-test? (bernoulli-trials 1 100) (mean-distribution 0.03))
=> #t

(kolmogorov-smirnoff-test dist1 dist2 [strict?])

Performs a Kolmogorov Smirnoff test, returning the distance between dist1 and dist2. This is frequently used to determine if two samples come from the same distribution, without needing to make assumptions about that distribution (e.g. it need not be normal).

(kolmogorov-smirnoff-test? dist1 dist2 [alpha strict?])

Returns #t iff the kolmogorov-smirnoff-test value between dist1 and dist2 is greater than the kolmogorov-smirnoff-critical value.

(kolmogorov-smirnoff-statistic dist1 dist2)

Returns the underlying statistic for the kolmogorov-smirnoff-test.

(kolmogorov-smirnoff-critical n m [alpha])

(wilcoxon-test? dist1 dist2 [tails alpha])

Performs a Wilcoxon signed ranked test to determine whether the mean ranks of dist1 and dist2 (which must be paired, i.e. the elements in aligned order) differ. It can be used as an alternative to the t-test when the distributions can't be assumed to be normal. Returns #t iff we fail to reject the null hypothesis, i.e. we assume the distributions are the same.

(wilcoxon-statistic dist1 dist2)

Returns the underlying statistic for the wilcoxon-test?.

(wilcoxon-critical nr tails alpha)

Returns the critical value for comparison against the statistic in the wilcoxon-test?.

(f-test seq-of-seqs)

Performs an f-test of equality of variance of the distributions. Returns the p-value.

(f-test? seq-of-seqs [alpha])

Returns #t iff the f-test is less than alpha.

(f-statistic seq-of-seqs)

Returns the underlying statistic for f-test.

Confidence Intervals

(wilson-lower-bound successes trials [confidence])

Returns the lower bound of the Wilson score interval, a binomial proportion confidence interval. This interval gives a better understanding of the probability of success from a series of Bernoulli trials than the raw ratio, which can be misleading for a small number of trials.

(wilson-upper-bound successes trials [confidence])

Returns the upper bound of the Wilson score interval.

(wilson-score-offset successes trials [z])

Returns the offset used in wilson-lower-bound and wilson-upper-bound.

Similarity

(jaccard-index dist1 dist2 [cmp])

A similarity measure, also known as the Jaccard similarity coefficient, returns the size of the intersection of seq1 and seq2 divided by the size of their union. The range is in [0, 1], with 0 indicating no intersection and 1 indicating they are the same sets. Equality is determined by the comparator cmp, defaulting to an equal? comparator.

(jaccard-distance dist1 dist2 . o)

Returns the dissimilarity between seq1 and seq2, this is 1 - the index.

(sorenson-dice-index seq1 seq2 . o)

The Sørensen-Dice coefficient, aka Czekanowski's binary index, Zijdenbos similarity index. Another similarity index, this is essentially the set equivalent of the harmonic-mean. The corresponding distance function (subtracted from 1), unlike Jaccard, is not a proper distance metric because it doesn't satisfy the triangle inequality.

Simple Linear Regression

(fit-line x y)

Computes a simple linear regression using ordinary least squares on the paired sequences x and y and returns two values: the y-intercept and the slope of the line closest fitting the points. This is the univariate case and so simply called fit-line to distinguish from the multivariate linear regression used in machine learning. Example: The number of SRFIs has been increasing at a steady rate in the past 20 years. Assuming the growth is linear, how many final SRFIs do we expect in 2030?
(let ((srfis '((1999 9) (2000 16) (2001 19) (2002 30) (2003 34)
               (2004 42) (2005 58) (2006 62) (2007 67) (2008 71)
               (2009 72) (2010 73) (2012 74) (2013 82) (2014 85)
               (2015 92) (2016 108) (2017 119) (2018 124)
               (2019 138) (2020 151))))
  (call-with-values (lambda () (fit-line (map car srfis) (map cadr srfis)))
    (lambda (m b)
      (let ((f (lambda (x) (inexact (+ (* m x) b)))))
        (f 2030)))))
=> 193.28006039038067
Given only 73 SRFIs over the first decade this may seem reasonable, but this doesn't account for the recent explosion in activity, with over 200 SRFIs counting those in draft. If the relation is not linear, fit-line is a bad tool.

Utilities

(cube x)

Returns x raised to the 3rd power.

(tesseract x)

Returns x raised to the 4th power.

(sign x)

Returns the sign of x, either 0, 1 or -1.

(factorial n)

Returns n! = 1*2*...*n.

(loggamma x)

Returns the log of the gamma function applied to x, i.e. just the first value of flloggamma from SRFI 144.

(lower-incomplete-gamma s z)

Returns the regularized lower incomplete gamma, γ(vars

(beta-pdf a b x)

The probability distribution function for the beta distribution.

(beta a b [x])

Returns the incomplete beta function, Β(a, b).

(inverse-transform-cdf cdf statistics)

Returns the quantile function, or inverse distribution function, for the given cdf, with statistics used to determine the domain.

(inverse-transform-random cdf statistics)

Utility to return a random value matching a given cdf, with statistics used to determine the domain.

(numeric-diff f)

Returns a procedure of one value which approximates the derivative of f.

(combinations n k)

Also known as the binomial coefficient, returns the number of combinations of k elements from a set of size n.

(permutations n k)

Returns the number of permutations of k ordered elements from a set of size n.

(sum seq . o)

Returns the sum of the elements of seq, which must be real. This and all operations involving sums are numerically stable. The current implementation uses Neumaier's variant of Kahan summation. Examples:
(sum (make-list 10000 (acos -1)))
=> 31415.926535897932
(sum '#(1 1e100 1 -1e100))
=> 2.0

(square-sum seq . o)

Like sum but sums the squares of the elements.

(reciprocal-sum seq . o)

Like sum but sums the inverse (/) of the elements.

(log-sum seq . o)

Like sum but sums the log of the elements.

(map-sum proc seq . o)

Generalized stable sum. Applies proc to the sequences and returns the sum of all the results. It is an error if the sequences don't all have the same length.

(make-running-sum [init])

Creates a new running-sum object initialized to init, defaulting to 0. A running-sum allows you take compute a numerically stable sum incrementally, so that it can be updated over time or fed from a generator or other data structure, etc.

(running-sum? x)

(running-sum-get rs)

Returns true iff x is a running-sum object. Returns the current value of the running-sum rs.

(running-sum-inc! rs x)

Adds x to the running-sum rs.

(running-sum-dec! rs x)

Subtracts x from the running-sum rs.

(product seq . o)

Returns the product of the elements in seq. Multiplication is stable so there is no need for error compensation.

(gain x y)

The relative change from x to y, useful to answer the question what percentage the gain (or loss) was. Examples:
(gain 50 60)
=> 1/5
(gain 2/3 1)
=> 1/2
(gain .75 .50)
=> -0.3333333333333333

Additional Data Structures

We make use of reservoirs and histograms for summarizing distributions.

Reservoirs

(make-reservoir max-size)

Returns a new reservoir, a random running sample of up to max-size elements. As new elements are added beyond max-size, the elements are probabilistic replaced such that the sample has the same distribution as if random-sample was run on the full population.

(reservoir-maybe-add! rv elt)

Probabilistically adds elt to the reservoir rv.

(reservoir->vector! rv)

Returns a vector containing the current elements of the reservoir rv, which may share space with the underlying data structure.

Histograms

(make-histogram [max-bins])

Returns a new empty histogram for describing a distribution using up to max-bins bins.

(histogram? x)

(histogram-add! hist elt)

Returns #t iff x is a histogram. Adds an element to a histogram.

(list->histogram ls [max-bins])

Returns a new histogram summarizing the elements of list ls.

(vector->histogram vec [max-bins])

Returns a new histogram summarizing the elements of vector vec.

(histogram-min hist)

Returns the smallest value seen by the histogram hist.

(histogram-max hist)

Returns the largest value seen by the histogram hist.

(histogram-scale hist . o)

(histogram-quantile hist x)

Returns the xth quantile ([0, 1]) of the histogram hist.

(histogram-deciles hist)

Returns the xth decile ([0, 10]) of the histogram hist.

(histogram-plot-ascii hist [width height])

Simple utility to plot a histogram with characters in a fixed-width font.

(histogram-cdf hist x)

Computes the cumulative distribution function of hist at point x.

References