4 Test statistic functions.
6 from collections
import namedtuple
9 import scipy.interpolate
11 import statsmodels.api
as sm
13 import distributions
as distr
14 from sympy
import binomial, log
21 Ks_2sampResult = namedtuple(
'Ks_2sampResult', (
'statistic',
'pvalue',
'ndf'))
24 Computes the Kolmogorov-Smirnov statistic on 2 samples/frequencies.
26 This is a two-sided test for the null hypothesis that 2 independent samples
27 are drawn from the same continuous distribution.
31 data1, data2 : sequence of 1-D ndarrays
32 Input data. Can be either observed frequencies in each category or array
33 of sample observations.
34 binned : bool, optional, default: False
35 If True/False, assume the type of input data is observed frequencies/samples.
44 「Degrees of freedom」: the degrees of freedom for the p-value. Always NaN in this case.
48 This code is modified from scipy.stats.ks_2samp and extended with supporting on
49 frequency data. See [1]_ for further information.
51 This tests whether 2 samples are drawn from the same distribution. Note
52 that, like in the case of the one-sample K-S test, the distribution is
53 assumed to be continuous.
55 This is the two-sided test, one-sided tests are not implemented.
56 The test uses the two-sided asymptotic Kolmogorov-Smirnov distribution.
58 If the K-S statistic is small or the p-value is high, then we cannot
59 reject the hypothesis that the distributions of the two samples
64 .. [1] scipy.stats.mstats.ks_2samp — SciPy Reference Guide
65 http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mstats.ks_2samp.html
66 .. [2] Frank C. Porter. (2008). Testing Consistency of Two Histograms, arXiv:0804.0380 [physics.data-an].
70 >>> from scipy import stats
71 >>> np.random.seed(12345678) #fix random seed to get the same result
72 >>> n1 = 200 # size of first sample
73 >>> n2 = 300 # size of second sample
75 For a different distribution, we can reject the null hypothesis since the
78 >>> rvs1 = stats.norm.rvs(size=n1, loc=0., scale=1)
79 >>> rvs2 = stats.norm.rvs(size=n2, loc=0.5, scale=1.5)
80 >>> stats.ks_2samp(rvs1, rvs2)
81 (0.20833333333333337, 4.6674975515806989e-005)
83 For a slightly different distribution, we cannot reject the null hypothesis
84 at a 10% or lower alpha since the p-value at 0.144 is higher than 10%
86 >>> rvs3 = stats.norm.rvs(size=n2, loc=0.01, scale=1.0)
87 >>> stats.ks_2samp(rvs1, rvs3)
88 (0.10333333333333333, 0.14498781825751686)
90 For an identical distribution, we cannot reject the null hypothesis since
91 the p-value is high, 41%:
93 >>> rvs4 = stats.norm.rvs(size=n2, loc=0.0, scale=1.0)
94 >>> stats.ks_2samp(rvs1, rvs4)
95 (0.07999999999999996, 0.41126949729859719)
99 cdf1 = np.cumsum(data1)
100 cdf2 = np.cumsum(data2)
111 data1 = np.sort(data1)
112 data2 = np.sort(data2)
113 data_all = np.concatenate([data1, data2])
114 cdf1 = np.searchsorted(data1, data_all, side=
'right') / (1.0*n1)
115 cdf2 = np.searchsorted(data2, data_all, side=
'right') / (1.0*n2)
117 data1_kde = sm.nonparametric.KDEUnivariate(data1)
118 data2_kde = sm.nonparametric.KDEUnivariate(data2)
121 i, min_sup =
min(enumerate([data1_kde.support, data1_kde.support]), key =
lambda data: data[1].shape[0])
124 cdf2 = scipy.interpolate.interp1d(data2_kde.support, data2_kde.cdf)(min_sup)
126 cdf1 = scipy.interpolate.interp1d(data1_kde.support, data1_kde.cdf)(min_sup)
128 d = np.max(np.absolute(cdf1 - cdf2))
130 en = np.sqrt(n1 * n2 /
float(n1 + n2))
132 prob = scipy.stats.distributions.kstwobign.sf((en + 0.12 + 0.11 / en) * d)
137 statistic_seq = np.absolute(cdf1 - cdf2)
141 Chi2_2sampResult = namedtuple(
'Chi2_2sampResult', (
'statistic',
'pvalue',
'ndf'))
144 The chi square test tests the null hypothesis that 2 given frequencies are drawn
145 from the same distribution.
149 data1, data2 : sequence of 1-D ndarrays
150 Input data. Observed frequencies in each category.
151 normed : bool, optional, default: True
152 If True/False, shape/value comparison test.
153 binned : bool, optional, default: True
154 If True/False, assume the type of input data is observed frequencies/samples.
159 The chi-squared test statistic.
161 Corresponding p-value.
163 「Degrees of freedom」: the degrees of freedom for the p-value. The p-value
164 is computed using a chi-squared distribution with k - 1 degrees of freedom,
165 where k is the number of observed frequencies without bins has zero counts
170 This code is modified from scipy.stats.chisquare and extended with supporting on
171 2 sample cases and shape comparison test.
174 filter = ~((data1 == 0.) & (data2 == 0.))
175 data1 = data1[filter]
176 data2 = data2[filter]
179 chi2 = np.power(data1 - data2, 2).
dot(np.power(data1 + data2, -1))
182 N_data1 = data1.sum()
183 N_data2 = data2.sum()
186 chi2 = np.power(data1 - data2, 2).
dot(np.power(data1/N_data1 + data2/N_data2, -1))
190 statistic_seq = np.power(data1 - data2, 2) * (np.power(data1 + data2, -1))
192 statistic_seq = np.power(data1 - data2, 2) * np.power(data1/N_data1 + data2/N_data2, -1)
195 BDM_2sampResult = namedtuple(
'BDM_2sampResult', (
'statistic',
'pvalue',
'ndf'))
198 The Bhattacharyya test tests the null hypothesis that 2 given frequencies are drawn
199 from the same distribution.
203 data1, data2 : sequence of 1-D ndarrays
204 Input data. Observed frequencies in each category.
205 normed : bool, optional
206 If True/False, shape/value comparison test.
207 binned : bool, optional
208 If True/False, assume the type of input data is observed frequencies/samples.
213 The Bhattacharyya distance measure.
215 Corresponding p-value.
217 「Degrees of freedom」: the degrees of freedom for the p-value. The p-value
218 is computed using a chi-squared distribution with k - 1 degrees of freedom,
219 where k is the number of observed frequencies without bins has zero counts
223 filter = ~((data1 == 0.) & (data2 == 0.))
224 data1 = data1[filter]
225 data2 = data2[filter]
227 N_data1 = data1.sum()
228 N_data2 = data2.sum()
229 bdm = np.sqrt(data1 * data2).
sum()
230 approx_chi2 = 2*N_data1+2*N_data2 - 4*bdm
233 bdm /= np.sqrt(N_data1 * N_data2)
236 statistic_seq = 2*data1+2*data2 - 4*np.sqrt(data1 * data2)
237 return BDM_2sampResult(bdm, scipy.stats.chi2.sf(approx_chi2, ndf), ndf)
239 CVM_2sampResult = namedtuple(
'CVM_2sampResult', (
'statistic',
'pvalue',
'ndf'))
242 Computes the Cram ́er-von-Mises statistic on 2 samples/frequencies.
244 This is a two-sided test for the null hypothesis that 2 independent samples
245 are drawn from the same continuous distribution.
249 data1 ,data2 : sequence of 1-D ndarrays
250 Input data. Can be either observed frequencies in each category or array
251 of sample observations.
252 binned : bool, optional, default: False
253 If True/False, assume the type of input data is observed frequencies/samples.
262 「Degrees of freedom」: the degrees of freedom for the p-value. Always NaN in this case.
270 cdf1 = np.cumsum(data1) / E1
271 cdf2 = np.cumsum(data2) / E2
272 weights = data1 + data2
276 data_all = np.concatenate([data1, data2])
277 cdf1 = np.searchsorted(data1, data_all, side=
'right') / (1.0*N1)
278 cdf2 = np.searchsorted(data2, data_all, side=
'right') / (1.0*N2)
280 cvm = (weights * (cdf1 - cdf2)**2).
sum() * E1 * E2 / (E1 + E2)**2
282 var = (N + 1.) * (4.*N1*N2*N - 3*(N1**2+N2**2) - 2*N1*N2) / (4 * N1 * N2) /N**2/45
285 statistic_seq = (weights * (cdf1 - cdf2)**2) * E1 * E2 / (E1 + E2)**2
296 Bj = Z.cumsum() - lj/2
297 for i
in np.arange(0, k):
299 Mij = d.cumsum() - d/2
300 inner = lj /
float(N) * (N*Mij - Bj*n[i])**2 / (Bj*(N - Bj) - N*lj/4.)
301 A2akN += np.nan_to_num(inner).
sum() / n[i]
304 statistic_seq.append(A2akN*(N-1.)/N)
306 A2akN *= (N - 1.) / N
311 Compute A2akN equation 7 of Scholz and Stephens.
315 samples : sequence of 1-D array_like
316 Array of sample arrays.
318 Sorted array of all observations.
320 Sorted array of unique observations.
324 Number of observations in each sample.
326 Total number of observations.
331 The A2aKN statistics of Scholz and Stephens 1987.
339 Z_ssorted_left = Z.searchsorted(Zstar,
'left')
343 lj = Z.searchsorted(Zstar,
'right') - Z_ssorted_left
344 Bj = Z_ssorted_left + lj / 2.
345 for i
in np.arange(0, k):
346 s = np.sort(samples[i])
347 s_ssorted_right = s.searchsorted(Zstar, side=
'right')
348 Mij = s_ssorted_right.astype(float)
349 fij = s_ssorted_right - s.searchsorted(Zstar,
'left')
351 inner = lj /
float(N) * (N*Mij - Bj*n[i])**2 / (Bj*(N - Bj) - N*lj/4.)
352 A2akN += inner.sum() / n[i]
355 statistic_seq.append(A2akN*(N-1.)/N)
357 A2akN *= (N - 1.) / N
363 Compute A2akN equation 6 of Scholz & Stephens.
367 samples : sequence of 1-D array_like
368 Array of sample arrays.
370 Sorted array of all observations.
372 Sorted array of unique observations.
376 Number of observations in each sample.
378 Total number of observations.
383 The A2KN statistics of Scholz and Stephens 1987.
387 lj = Z.searchsorted(Zstar[:-1],
'right') - Z.searchsorted(Zstar[:-1],
'left')
389 for i
in range(0, k):
390 s = np.sort(samples[i])
391 Mij = s.searchsorted(Zstar[:-1], side=
'right')
392 inner = lj /
float(N) * (N * Mij - Bj * n[i])**2 / (Bj * (N - Bj))
393 A2kN += inner.sum() / n[i]
397 Anderson_ksampResult = namedtuple(
'Anderson_ksampResult', (
'statistic',
'pvalue',
'ndf'))
399 """The Anderson-Darling test for 2 samples.
401 It tests the null hypothesis
402 that the 2 samples are drawn from the same population without
403 having to specify the distribution function of that population.
404 The critical values depend on the number of samples.
408 samples : sequence of 1-D array_like
409 Array of sample data in arrays.
418 「Degrees of freedom」: the degrees of freedom for the p-value.
419 Always NaN in this case.
424 If less than 2 samples are provided, a sample is empty, or no
425 distinct observations are in the samples.
429 ks_2samp : 2 sample Kolmogorov-Smirnov test
433 This code is modified from scipy.stats.anderson and extended with
434 supporting on frequency data. See [1]_ for further information.
436 [2]_ Defines three versions of the k-sample Anderson-Darling test:
437 one for continuous distributions and two for discrete
438 distributions, in which ties between samples may occur. The
439 default of this routine is to compute the version based on the
440 midrank empirical distribution function. This test is applicable
441 to continuous and discrete data. If midrank is set to False, the
442 right side empirical distribution is used for a test for discrete
443 data. According to [1]_, the two discrete test statistics differ
444 only slightly if a few collisions due to round-off errors occur in
445 the test not adjusted for ties between samples.
449 .. [1] scipy.stats.anderson — SciPy Reference Guide
450 http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.anderson.html
451 .. [2] Scholz, F. W and Stephens, M. A. (1987), K-Sample
452 Anderson-Darling Tests, Journal of the American Statistical
453 Association, Vol. 82, pp. 918-924.
457 >>> from scipy import stats
458 >>> np.random.seed(314159)
460 The null hypothesis that the two random samples come from the same
461 distribution can be rejected at the 5% level because the returned
462 test value is greater than the critical value for 5% (1.961) but
463 not at the 2.5% level. The interpolation gives an approximate
464 significance level of 3.1%:
466 >>> stats.anderson_ksamp([np.random.normal(size=50),
467 ... np.random.normal(loc=0.5, size=30)])
469 array([ 0.325, 1.226, 1.961, 2.718, 3.752]),
473 The null hypothesis cannot be rejected for three samples from an
474 identical distribution. The approximate p-value (87%) has to be
475 computed by extrapolation and may not be very accurate:
477 >>> stats.anderson_ksamp([np.random.normal(size=50),
478 ... np.random.normal(size=30), np.random.normal(size=20)])
479 (-0.73091722665244196,
480 array([ 0.44925884, 1.3052767 , 1.9434184 , 2.57696569, 3.41634856]),
484 samples = [data1, data2]
487 raise ValueError(
"anderson_ksamp needs at least two samples")
489 samples =
list(map(np.asarray, samples))
491 Z = Zstar =
sum(samples)
492 n = np.array([sample.sum()
for sample
in samples])
496 Z = np.sort(np.hstack(samples))
500 raise ValueError(
"anderson_ksamp needs more than one distinct "
503 n = np.array([sample.size
for sample
in samples])
505 raise ValueError(
"anderson_ksamp encountered sample without "
514 hs_cs = (1. / np.arange(N - 1, 1, -1)).cumsum()
516 g = (hs_cs / np.arange(2, N)).
sum()
518 a = (4*g - 6) * (k - 1) + (10 - 6*g)*H
519 b = (2*g - 4)*k**2 + 8*h*k + (2*g - 14*h - 4)*H - 8*h + 4*g - 6
520 c = (6*h + 2*g - 2)*k**2 + (4*h - 4*g + 6)*k + (2*h - 6)*H + 4*h
521 d = (2*h + 6)*k**2 - 4*h*k
522 sigmasq = (a*N**3 + b*N**2 + c*N + d) / ((N - 1.) * (N - 2.) * (N - 3.))
524 A2 = (A2kN - m) / np.sqrt(sigmasq)
528 b0 = np.array([0.675, 1.281, 1.645, 1.96, 2.326])
529 b1 = np.array([-0.245, 0.25, 0.678, 1.149, 1.822])
530 b2 = np.array([-0.105, -0.305, -0.362, -0.391, -0.396])
531 critical = b0 + b1 / np.sqrt(m) + b2 / m
532 pf = np.polyfit(critical, np.log(np.array([0.25, 0.1, 0.05, 0.025, 0.01])), 2)
533 if A2 < critical.min()
or A2 > critical.max():
534 warnings.warn(
"approximate p-value will be computed by extrapolation")
536 p = np.exp(np.polyval(pf, A2))
540 return np.log(np.arange(m,n+1)).
sum()
556 a = N_data2 / N_data1
560 statistic_seq = -2 * (logC - scipy.special.xlogy(t, 1+a) + scipy.special.xlogy(data2, a))
562 return -2 * (logC - scipy.special.xlogy(t, 1+a).
sum() + scipy.special.xlogy(data2, a).
sum())
565 z_not_in_both = (data1 == 0) ^ (data2 == 0)
566 data1 = data1[z_not_in_both]
567 data2 = data2[z_not_in_both]
570 p = N_data1 / (N_data1 + N_data2)
574 statistic_seq = np.concatenate([statistic_seq, -2 * scipy.special.xlogy(t[data2==0], p), scipy.special.xlogy(t[data1 == 0], 1-p)])
576 return -2 * (scipy.special.xlogy(t[data2==0], p).
sum() + scipy.special.xlogy(t[data1 == 0], 1-p).
sum())
580 LikelihoodRatio_ksampResult = namedtuple(
'LikelihoodRatio_ksampResult', (
'statistic',
'pvalue',
'ndf'))
583 The likelihood-ratio test tests the null hypothesis that 2 given frequencies are
584 drawn from the same distribution.
588 data1, data2 : sequence of 1-D ndarrays
589 Input data. Observed frequencies in each category.
590 normed : bool, optional, default: True
591 If True/False, shape/value comparison test.
592 binned : bool, optional, default: True
593 If True/False, assume the type of input data is observed frequencies/samples.
598 The llike-ratio test statistic.
600 Corresponding p-value.
602 「Degrees of freedom」: the degrees of freedom for the p-value. The p-value
603 is computed using a chi-squared distribution with k - 1 degrees of freedom,
604 where k is the number of observed frequencies without bins has zero counts
607 not_all_zeros = (data1 != 0) & (data2 != 0)
608 ndf =
sum((data1 != 0) | (data2 != 0))
609 N_data1 = data1.sum()
610 N_data2 = data2.sum()
613 llike_H0 =
_loglikelihood(data1[not_all_zeros], data2[not_all_zeros], N_data1, N_data2, ratio=
True, H=
"H0") +
_zloglikelihood(data1[~not_all_zeros], data2[~not_all_zeros], N_data1, N_data2, H=
"H0")
616 STATISTIC_SEQ = statistic_seq
617 llike_H1 =
_loglikelihood(data1[not_all_zeros], data2[not_all_zeros], N_data1, N_data2, ratio=
True, H=
"H1")
619 STATISTIC_SEQ[:statistic_seq.shape[0]] -= statistic_seq
620 statistic_seq = STATISTIC_SEQ
621 llike_ratio = llike_H0 - llike_H1
624 LikelihoodValue_ksampResult = namedtuple(
'LikelihoodValue_ksampResult', (
'statistic',
'pvalue',
'ndf'))
627 The likelihood-value test tests the null hypothesis that 2 given frequencies are
628 drawn from the same distribution.
632 data1, data2 : sequence of 1-D ndarrays
633 Input data. Observed frequencies in each category.
634 normed : bool, optional, default: True
635 If True/False, shape/value comparison test.
636 binned : bool, optional, default: True
637 If True/False, assume the type of input data is observed frequencies/samples.
642 The llike-value test statistic.
644 Corresponding p-value.
646 「Degrees of freedom」: the degrees of freedom for the p-value. The p-value
647 is computed using a chi-squared distribution with k - 1 degrees of freedom,
648 where k is the number of observed frequencies without bins has zero counts
651 no_zeros = (data1 != 0) | (data2 != 0)
655 llike =
_loglikelihood(data1[no_zeros], data2[no_zeros], data1.sum(), data2.sum(), ratio=
False, H=
"H0")