Using the chi square test. MS EXCEL functions using the CH2 distribution

The chi-square test is a universal method for checking the agreement between the results of an experiment and the statistical model used.

Pearson distance X 2

Pyatnitsky A.M.

Russian State Medical University

In 1900, Karl Pearson proposed a simple, universal and effective way to test the agreement between model predictions and experimental data. The “chi-square test” he proposed is the most important and most commonly used statistical test. Most problems associated with estimating unknown model parameters and checking the agreement between the model and experimental data can be solved with its help.

Let there be an a priori (“pre-experimental”) model of the object or process being studied (in statistics they speak of the “null hypothesis” H 0), and the results of an experiment with this object. It is necessary to decide whether the model is adequate (does it correspond to reality)? Do the experimental results contradict our ideas about how reality works, or in other words, should H0 be rejected? Often this task can be reduced to comparing the observed (O i = Observed) and expected according to the model (E i = Expected) average frequencies of occurrence of certain events. It is believed that the observed frequencies were obtained in a series of N independent (!) observations made under constant (!) conditions. As a result of each observation, one of M events is recorded. These events cannot occur simultaneously (they are incompatible in pairs) and one of them necessarily occurs (their combination forms a reliable event). The totality of all observations is reduced to a table (vector) of frequencies (O i )=(O 1 ,… O M ), which completely describes the results of the experiment. The value O 2 =4 means that event number 2 occurred 4 times. Sum of frequencies O 1 +… O M =N. It is important to distinguish between two cases: N – fixed, non-random, N – random variable. For a fixed total number of experiments N, the frequencies have a polynomial distribution. Let us illustrate this general scheme with a simple example.

Using the chi-square test to test simple hypotheses.

Let the model (null hypothesis H 0) be that the die is fair - all faces appear equally often with probability p i =1/6, i =, M=6. An experiment was conducted in which the die was thrown 60 times (N = 60 independent trials were conducted). According to the model, we expect that all observed frequencies O i of occurrence 1,2,... 6 points should be close to their average values ​​E i =Np i =60∙(1/6)=10. According to H 0, the vector of average frequencies (E i )=(Np i )=(10, 10, 10, 10, 10, 10). (Hypotheses in which the average frequencies are completely known before the start of the experiment are called simple.) If the observed vector (O i ) were equal to (34,0,0,0,0,26), then it is immediately clear that the model is incorrect - bone cannot be correct, since only 1 and 6 were rolled 60 times. The probability of such an event for a correct dice is negligible: P = (2/6) 60 =2.4*10 -29. However, the appearance of such obvious discrepancies between the model and experience is an exception. Let the vector of observed frequencies (O i ) be equal to (5, 15, 6, 14, 4, 16). Is this consistent with H0? So, we need to compare two frequency vectors (E i) and (O i). In this case, the vector of expected frequencies (Ei) is not random, but the vector of observed frequencies (Oi) is random - during the next experiment (in a new series of 60 throws) it will turn out to be different. It is useful to introduce a geometric interpretation of the problem and assume that in frequency space (in this case 6-dimensional) two points are given with coordinates (5, 15, 6, 14, 4, 16) and (10, 10, 10, 10, 10, 10 ). Are they far enough apart to consider this incompatible with H 0 ? In other words, we need:

  1. learn to measure distances between frequencies (points in frequency space),
  2. have a criterion for what distance should be considered too (“implausibly”) large, that is, inconsistent with H 0 .

The square of the ordinary Euclidean distance would be equal to:

X 2 Euclid = S(O i -E i) 2 = (5-10) 2 +(15-10) 2 + (6-10) 2 +(14-10) 2 +(4-10) 2 +(16-10) 2

In this case, the surfaces X 2 Euclid = const are always spheres if we fix the values ​​of E i and change O i . Karl Pearson noted that the use of Euclidean distance in frequency space should not be used. Thus, it is incorrect to assume that the points (O = 1030 and E = 1000) and (O = 40 and E = 10) are at equal distances from each other, although in both cases the difference is O -E = 30. After all, the higher the expected frequency, the greater deviations from it should be considered possible. Therefore, the points (O =1030 and E =1000) should be considered “close”, and the points (O =40 and E =10) “far” from each other. It can be shown that if the hypothesis H 0 is true, then the frequency fluctuations O i relative to E i are of the order of the square root(!) of E i . Therefore, Pearson proposed, when calculating the distance, to square not the differences (O i -E i), but the normalized differences (O i -E i)/E i 1/2. So here's the formula to calculate the Pearson distance (it's actually the square of the distance):

X 2 Pearson = S((O i -E i )/E i 1/2) 2 = S(O i -E i ) 2 /E i

In our example:

X 2 Pearson = (5-10) 2 /10+(15-10) 2 /10 +(6-10) 2 /10+(14-10) 2 /10+(4-10) 2 /10+( 16-10) 2 /10=15.4

For a regular die, all expected frequencies E i are the same, but usually they are different, so surfaces on which the Pearson distance is constant (X 2 Pearson =const) turn out to be ellipsoids, not spheres.

Now that the formula for calculating the distances has been chosen, it is necessary to find out which distances should be considered “not too large” (consistent with H 0). So, for example, what can we say about the distance we calculated 15.4? In what percentage of cases (or with what probability) would we get a distance greater than 15.4 when conducting experiments with a regular die? If this percentage is small (<0.05), то H 0 надо отвергнуть. Иными словами требуется найти распределение длярасстояния Пирсона. Если все ожидаемые частоты E i не слишком малы (≥5), и верна H 0 , то нормированные разности (O i - E i )/E i 1/2 приближенно эквивалентны стандартным гауссовским случайным величинам: (O i - E i )/E i 1/2 ≈N (0,1). Это, например, означает, что в 95% случаев| (O i - E i )/E i 1/2 | < 1.96 ≈ 2 (правило “двух сигм”).

Explanation. The number of measurements O i falling into the table cell with number i has a binomial distribution with the parameters: m =Np i =E i,σ =(Np i (1-p i)) 1/2, where N is the number of measurements (N " 1), p i is the probability for one measurement to fall into a given cell (recall that the measurements are independent and are carried out under constant conditions). If p i is small, then: σ≈(Np i ) 1/2 =E i and the binomial distribution is close to Poisson, in which the average number of observations E i =λ, and the standard deviation σ=λ 1/2 = E i 1/ 2. For λ≥5, the Poisson distribution is close to normal N (m =E i =λ, σ=E i 1/2 =λ 1/2), and the normalized value (O i - E i )/E i 1/2 ≈ N (0 ,1).

Pearson defined the random variable χ 2 n – “chi-square with n degrees of freedom”, as the sum of the squares of n independent standard normal random variables:

χ 2 n = T 1 2 + T 2 2 + …+ T n 2 , where is everyone T i = N(0,1) - n. O. R. With. V.

Let's try to clearly understand the meaning of this most important random variable in statistics. To do this, on the plane (with n = 2) or in space (with n = 3) we present a cloud of points whose coordinates are independent and have a standard normal distributionf T (x) ~exp (-x 2 /2). On a plane, according to the “two sigma” rule, which is independently applied to both coordinates, 90% (0.95*0.95≈0.90) of points are contained within a square (-2

f χ 2 2 (a) = Сexp(-a/2) = 0.5exp(-a/2).

With a sufficiently large number of degrees of freedom n (n > 30), the chi-square distribution approaches normal: N (m = n; σ = (2n) ½). This is a consequence of the “central limit theorem”: the sum of identically distributed quantities with finite variance approaches the normal law as the number of terms increases.

In practice, you need to remember that the average square of the distance is equal to m (χ 2 n) = n, and its variance is σ 2 (χ 2 n) = 2n. From here it is easy to conclude which chi-square values ​​should be considered too small and too large: most of the distribution lies in the range from n -2∙(2n) ½ to n +2∙(2n) ½.

So, Pearson distances significantly exceeding n +2∙ (2n) ½ should be considered implausibly large (inconsistent with H 0). If the result is close to n +2∙(2n) ½, then you should use tables in which you can find out exactly in what proportion of cases such and large chi-square values ​​can appear.

It is important to know how to choose the right value for the number of degrees of freedom (abbreviated n.d.f.). It seemed natural to assume that n was simply equal to the number of digits: n =M. In his article, Pearson suggested as much. In the dice example, this would mean that n =6. However, several years later it was shown that Pearson was mistaken. The number of degrees of freedom is always less than the number of digits if there are connections between the random variables O i. For the dice example, the sum O i is 60, and only 5 frequencies can be changed independently, so the correct value is n = 6-1 = 5. For this value of n we get n +2∙(2n) ½ =5+2∙(10) ½ =11.3. Since 15.4>11.3, then the hypothesis H 0 - the die is correct, should be rejected.

After clarifying the error, the existing χ 2 tables had to be supplemented, since initially they did not have the case n = 1, since the smallest number of digits = 2. Now it turns out that there may be cases when the Pearson distance has the distribution χ 2 n =1.

Example. With 100 coin tosses, the number of heads is O 1 = 65, and tails O 2 = 35. The number of digits is M = 2. If the coin is symmetrical, then the expected frequencies are E 1 =50, E 2 =50.

X 2 Pearson = S(O i -E i) 2 /E i = (65-50) 2 /50 + (35-50) 2 /50 = 2*225/50 = 9.

The resulting value should be compared with those that the random variable χ 2 n =1 can take, defined as the square of the standard normal value χ 2 n =1 =T 1 2 ≥ 9 ó T 1 ≥3 or T 1 ≤-3. The probability of such an event is very low P (χ 2 n =1 ≥9) = 0.006. Therefore, the coin cannot be considered symmetrical: H 0 should be rejected. The fact that the number of degrees of freedom cannot be equal to the number of digits is evident from the fact that the sum of the observed frequencies is always equal to the sum of the expected ones, for example O 1 +O 2 =65+35 = E 1 +E 2 =50+50=100. Therefore, random points with coordinates O 1 and O 2 are located on a straight line: O 1 +O 2 =E 1 +E 2 =100 and the distance to the center turns out to be less than if this restriction did not exist and they were located on the entire plane. Indeed, for two independent random variables with mathematical expectations E 1 =50, E 2 =50, the sum of their realizations should not always be equal to 100 - for example, the values ​​O 1 =60, O 2 =55 would be acceptable.

Explanation. Let's compare the result of the Pearson criterion at M = 2 with what the Moivre-Laplace formula gives when estimating random fluctuations in the frequency of occurrence of an event ν =K /N having a probability p in a series of N independent Bernoulli tests (K is the number of successes):

χ 2 n =1 = S(O i -E i) 2 /E i = (O 1 -E 1) 2 /E 1 + (O 2 -E 2) 2 /E 2 = (Nν -Np) 2 /(Np) + (N ( 1-ν )-N (1-p )) 2 /(N (1-p ))=

=(Nν-Np) 2 (1/p + 1/(1-p))/N=(Nν-Np) 2 /(Np(1-p))=((K-Np)/(Npq) ½ ) 2 = T 2

Value T =(K -Np)/(Npq) ½ = (K -m (K))/σ(K) ≈N (0.1) with σ(K)=(Npq) ½ ≥3. We see that in this case Pearson's result exactly coincides with what the normal approximation gives for the binomial distribution.

So far we have considered simple hypotheses for which the expected average frequencies E i are completely known in advance. For information on how to choose the correct number of degrees of freedom for complex hypotheses, see below.

Using the chi-square test to test complex hypotheses

In the examples with a regular die and coin, the expected frequencies could be determined before(!) the experiment. Such hypotheses are called “simple”. In practice, “complex hypotheses” are more common. Moreover, in order to find the expected frequencies E i it is necessary to first estimate one or several quantities (model parameters), and this can only be done using experimental data. As a result, for “complex hypotheses” the expected frequencies E i turn out to depend on the observed frequencies O i and therefore themselves become random variables, varying depending on the results of the experiment. In the process of selecting parameters, the Pearson distance decreases - the parameters are selected so as to improve the agreement between the model and experiment. Therefore, the number of degrees of freedom should decrease.

How to estimate model parameters? There are many different estimation methods - “maximum likelihood method”, “method of moments”, “substitution method”. However, you can not use any additional funds and find parameter estimates by minimizing the Pearson distance. In the pre-computer era, this approach was rarely used: it is inconvenient for manual calculations and, as a rule, cannot be solved analytically. When calculating on a computer, numerical minimization is usually easy to carry out, and the advantage of this method is its versatility. So, according to the “chi-square minimization method,” we select the values ​​of the unknown parameters so that the Pearson distance becomes the smallest. (By the way, by studying changes in this distance with small displacements relative to the found minimum, you can estimate the measure of accuracy of the estimate: construct confidence intervals.) After the parameters and this minimum distance itself have been found, it is again necessary to answer the question of whether it is small enough.

The general sequence of actions is as follows:

  1. Model selection (hypothesis H 0).
  2. Selection of digits and determination of the vector of observed frequencies O i .
  3. Estimation of unknown model parameters and construction of confidence intervals for them (for example, by searching for the minimum Pearson distance).
  4. Calculation of expected frequencies E i .
  5. Comparison of the found value of the Pearson distance X 2 with the critical value of chi-square χ 2 crit - the largest, which is still considered plausible, compatible with H 0. We find the value χ 2 crit from the tables by solving the equation

P (χ 2 n > χ 2 crit)=1-α,

where α is the “level of significance” or “size of the criterion” or “magnitude of the first type error” (typical value α = 0.05).

Usually the number of degrees of freedom n is calculated using the formula

n = (number of digits) – 1 – (number of parameters to be estimated)

If X 2 > χ 2 crit, then the hypothesis H 0 is rejected, otherwise it is accepted. In α∙100% of cases (that is, quite rarely), this method of checking H 0 will lead to an “error of the first kind”: the hypothesis H 0 will be rejected erroneously.

Example. When studying 10 series of 100 seeds, the number of green-eyed fly-infected ones was counted. Data received: O i =(16, 18, 11, 18, 21, 10, 20, 18, 17, 21);

Here the vector of expected frequencies is unknown in advance. If the data are homogeneous and obtained for a binomial distribution, then one parameter is unknown: the proportion p of infected seeds. Note that in the original table there are actually not 10 but 20 frequencies that satisfy 10 connections: 16+84=100, ... 21+79=100.

X 2 = (16-100p) 2 /100p +(84-100(1-p)) 2 /(100(1-p))+…+

(21-100p) 2 /100p +(79-100(1-p)) 2 /(100(1-p))

Combining terms in pairs (as in the example with a coin), we obtain the form of writing the Pearson criterion, which is usually written immediately:

X 2 = (16-100p) 2 /(100p(1-p))+…+ (21-100p) 2 /(100p(1-p)).

Now, if the minimum Pearson distance is used as a method for estimating p, then it is necessary to find a p for which X 2 =min. (The model tries, if possible, to “adjust” to the experimental data.)

The Pearson criterion is the most universal of all used in statistics. It can be applied to univariate and multivariate data, quantitative and qualitative features. However, precisely because of its versatility, one should be careful not to make mistakes.

Important points

1.Selection of categories.

  • If the distribution is discrete, then there is usually no arbitrariness in the choice of digits.
  • If distribution is continuous, then arbitrariness is inevitable. Statistically equivalent blocks can be used (all O are the same, for example =10). However, the lengths of the intervals are different. When doing manual calculations, they tried to make the intervals the same. Should the intervals when studying the distribution of a univariate trait be equal? No.
  • The digits must be combined so that the expected (and not observed!) frequencies are not too small (≥5). Let us recall that it is they (E i) that are in the denominators when calculating X 2! When analyzing one-dimensional characteristics, it is allowed to violate this rule in the two extreme digits E 1 =E max =1. If the number of digits is large and the expected frequencies are close, then X 2 is a good approximation of χ 2 even for E i =2.

Parameter Estimation. The use of “homemade”, inefficient estimation methods can lead to inflated Pearson distance values.

Choosing the right number of degrees of freedom. If parameter estimates are made not from frequencies, but directly from the data (for example, the arithmetic mean is taken as an estimate of the mean), then the exact number of degrees of freedom n is unknown. We only know that it satisfies the inequality:

(number of digits – 1 – number of parameters being evaluated)< n < (число разрядов – 1)

Therefore, it is necessary to compare X 2 with the critical values ​​of χ 2 crit calculated throughout this range of n.

How to interpret implausibly small chi-square values? Should a coin be considered symmetrical if, after 10,000 tosses, it lands on the coat of arms 5,000 times? Previously, many statisticians believed that H 0 should also be rejected. Now another approach is proposed: accept H 0, but subject the data and the methodology for their analysis to additional verification. There are two possibilities: either a too small Pearson distance means that increasing the number of model parameters was not accompanied by a proper decrease in the number of degrees of freedom, or the data itself was falsified (perhaps unintentionally adjusted to the expected result).

Example. Two researchers A and B calculated the proportion of recessive homozygotes aa in the second generation of an AA * aa monohybrid cross. According to Mendel's laws, this fraction is 0.25. Each researcher conducted 5 experiments, and 100 organisms were studied in each experiment.

Results A: 25, 24, 26, 25, 24. Researcher’s conclusion: Mendel’s law is true(?).

Results B: 29, 21, 23, 30, 19. Researcher’s conclusion: Mendel’s law is not fair(?).

However, Mendel's law is of a statistical nature, and quantitative analysis of the results reverses the conclusions! Combining five experiments into one, we arrive at a chi-square distribution with 5 degrees of freedom (a simple hypothesis is tested):

X 2 A = ((25-25) 2 +(24-25) 2 +(26-25) 2 +(25-25) 2 +(24-25) 2)/(100∙0.25∙0.75)=0.16

X 2 B = ((29-25) 2 +(21-25) 2 +(23-25) 2 +(30-25) 2 +(19-25) 2)/(100∙0.25∙0.75)=5.17

Average value m [χ 2 n =5 ]=5, standard deviation σ[χ 2 n =5 ]=(2∙5) 1/2 =3.2.

Therefore, without reference to the tables, it is clear that the value of X 2 B is typical, and the value of X 2 A is implausibly small. According to tables P (χ 2 n =5<0.16)<0.0001.

This example is an adaptation of a real case that occurred in the 1930s (see Kolmogorov’s work “On Another Proof of Mendel’s Laws”). Interestingly, Researcher A was a proponent of genetics, while Researcher B was against it.

Confusion in notation. It is necessary to distinguish the Pearson distance, which requires additional conventions in its calculation, from the mathematical concept of a chi-square random variable. The Pearson distance under certain conditions has a distribution close to chi-square with n degrees of freedom. Therefore, it is advisable NOT to denote the Pearson distance by the symbol χ 2 n, but to use a similar but different notation X 2. .

The Pearson criterion is not omnipotent. There are an infinite number of alternatives for H 0 that he is unable to take into account. Suppose you are testing the hypothesis that the feature had a uniform distribution, you have 10 digits and the vector of observed frequencies is equal to (130,125,121,118,116,115,114,113,111,110). The Pearson criterion cannot “notice” that the frequencies are monotonically decreasing and H 0 will not be rejected. If it were supplemented with a series criterion, then yes!

The use of this criterion is based on the use of such a measure (statistics) of the discrepancy between the theoretical F(x) and empirical distribution F* n(x), which approximately obeys the distribution law χ 2 . Hypothesis H 0 The consistency of the distributions is checked by analyzing the distribution of these statistics. Application of the criterion requires the construction of a statistical series.

So, let the sample be presented statistically next to the number of digits M. Observed hit rate i- th rank n i. In accordance with the theoretical distribution law, the expected frequency of hits in i-th category is F i. The difference between the observed and expected frequency will be ( n iF i). To find the overall degree of discrepancy between F(x) And F* n (x) it is necessary to calculate the weighted sum of squared differences across all digits of the statistical series

Value χ 2 with unlimited magnification n has a χ 2 distribution (asymptotically distributed as χ 2). This distribution depends on the number of degrees of freedom k, i.e. the number of independent values ​​of the terms in expression (3.7). The number of degrees of freedom is equal to the number y minus the number of linear relationships imposed on the sample. One connection exists due to the fact that any frequency can be calculated from the totality of frequencies in the remaining M–1 digits. In addition, if the distribution parameters are not known in advance, then there is another limitation due to fitting the distribution to the sample. If the sample determines S distribution parameters, then the number of degrees of freedom will be k=M –S–1.

Hypothesis Acceptance Area H 0 is determined by the condition χ 2 < χ 2(k;a), where χ 2(k;a)– critical point of the χ2 distribution with significance level a. The probability of a type I error is a, the probability of a type II error cannot be clearly defined, because there are an infinitely large number of different ways that distributions may not match. The power of the test depends on the number of digits and sample size. The criterion is recommended to be applied when n>200, use is allowed when n>40, it is under such conditions that the criterion is valid (as a rule, it rejects the incorrect null hypothesis).

Algorithm for checking by criterion

1. Construct a histogram using an equal probability method.

2. Based on the appearance of the histogram, put forward a hypothesis

H 0: f(x) = f 0(x),

H 1: f(x) f 0(x),

Where f 0(x) - probability density of a hypothetical distribution law (for example, uniform, exponential, normal).

Comment. The hypothesis about the exponential distribution law can be put forward if all the numbers in the sample are positive.


3. Calculate the value of the criterion using the formula

,

where is the hit frequency i-th interval;

pi- theoretical probability of a random variable falling into i- th interval provided that the hypothesis H 0true.

Formulas for calculation pi in the case of exponential, uniform and normal laws, they are respectively equal.

exponential law

. (3.8)

Wherein A 1 = 0, Bm= +.

Uniform law

Normal Law

. (3.10)

Wherein A 1 = -, B M = +.

Notes. After calculating all the probabilities pi check if the reference relation is satisfied

Function Ф( X) - odd. Ф(+) = 1.

4. From the “Chi-square” table in the Appendix, the value is selected, where is the specified level of significance (= 0.05 or = 0.01), and k- the number of degrees of freedom, determined by the formula

k= M- 1 - S.

Here S- the number of parameters on which the chosen hypothesis depends H 0distribution law. Values S for the uniform law it is 2, for the exponential law it is 1, for the normal law it is 2.

5. If , then hypothesis H 0deviates. Otherwise, there is no reason to reject it: with probability 1, it is true, and with probability, it is false, but the value is unknown.

Example3 . 1. Using criterion 2, put forward and test a hypothesis about the distribution law of a random variable X, the variation series, interval tables and distribution histograms of which are given in example 1.2. The significance level is 0.05.

Solution . Based on the appearance of histograms, we put forward the hypothesis that the random variable X distributed according to the normal law:

H 0: f(x) = N(m,);

H 1: f(x) N(m,).

The value of the criterion is calculated using the formula.

Description of criterion

Purpose of the criterion

Pearson's chi-square test

Lecture materials

Topic 6. Identifying differences in the distribution of a trait

Pearson criterion: purpose of the criterion, its description, scope of application, calculation algorithm.

Kolmogorov–Smirnov criterion for comparing the results of quantitative measurements: purpose of the criterion, its description, scope of application, calculation algorithm.

When studying this topic, it is necessary to take into account that both criteria are nonparametric; they operate with frequencies. Pay special attention to the decision rules for the considered criteria: these rules may be opposite. Please review carefully the limitations in the application of the criteria.

After studying the lecture material, answer the test questions and write down the answers in your notes.

The Pearson chi-square test can solve several problems, including comparing distributions.

The χ 2 test is used for two purposes;

1) for comparison empirical distribution of the characteristic with theoretical - uniform, normal or otherwise;

2) for comparison two, three or more empirical distributions of the same characteristic, that is, to check their homogeneity;

3) to assess stochastic (probabilistic) independence in a system of random events, etc.

The χ 2 criterion answers the question of whether different values ​​of a characteristic occur with equal frequency in empirical and theoretical distributions or in two or more empirical distributions.

The advantage of the method is that it allows one to compare the distributions of features presented on any scale, starting from the scale of names. In the simplest case of an alternative distribution (“yes - no”, “allowed a defect - did not allow a defect”, “solved the problem - did not solve the problem”, etc.), we can already apply the χ 2 criterion.

1. The sample size should be large enough: N>30. When N<30 критерий χ 2 дает весьма приближенные значения. Точность крите­рия повышается при больших N.

2. The theoretical frequency for each table cell should not be less than 5: f ≥ 5 . This means that if the number of digits is predetermined and cannot be changed, then we cannot apply the χ 2 method , without accumulating a certain minimum number of observations. If, for example, we want to test our assumptions that the frequency of calls to the Trust telephone service is unevenly distributed over 7 days of the week, then we will need 5-7 = 35 calls. Thus, if the number of digits (k) given in advance, as in this case, the minimum number of observations (N min) is determined by the formula: .



3. The selected categories must “scoop out” the entire distribution, that is, cover the entire range of variability of characteristics. In this case, the grouping into categories must be the same in all compared distributions.

4. It is necessary to make a “continuity correction” when comparing distributions of features that take only 2 values. When making a correction, the value of χ 2 decreases (see example with continuity correction).

5. The categories must be non-overlapping: if an observation is assigned to one category, then it can no longer be assigned to any other category. The sum of observations by rank must always be equal to the total number of observations.

Algorithm for calculating the χ 2 criterion

1. Create a table of mutual conjugacy of feature values ​​of the following type (essentially, this is a two-dimensional variation series in which the frequencies of occurrence of joint feature values ​​are indicated) - table 19. The table contains conditional frequencies, which we will denote in general form as f ij. For example, the number of gradations of a characteristic X equals 3 (k=3), the number of gradations of the characteristic at equals 4 (m=4); Then i varies from 1 to k, and j varies from 1 to m.

Table 19

x i y j x 1 x 2 x 3
at 1 f 11 f 21 f 31 f –1
at 2 f 12 f 22 f 32 f –2
at 3 f 13 f 23 f 33 f –3
at 4 f 14 f 24 f 34 f –4
f 1– f 2– f 3– N

2. Next, for the convenience of calculations, we transform the original table of mutual contingency into a table of the following form (Table 20), placing the columns with conditional frequencies one below the other: Enter into the table the names of the categories (columns 1 and 2) and the corresponding empirical frequencies (3rd column ).

Table 20

x i y j f ij f ij * f ij – f ij * (f ij – f ij *) 2 (f ij – f ij *) 2 / f ij *
1 2 3 4 5 6 7
x 1 at 1 f 11 f 11*
x 1 at 2 f 12 f 12*
x 1 at 3 f 13 f 13*
x 1 at 4 f 14 f 14*
x 2 at 1 f 21 f 21 *
x 2 at 2 f 22 f 22 *
x 2 at 3 f 23 f 23 *
x 2 at 4 f 24 f 24 *
x 3 at 1 f 31 f 31 *
x 3 at 2 f 32 f 32 *
x 3 at 3 f 33 f 33 *
x 3 at 4 f 34 f 34*
∑=………….

3. Next to each empirical frequency, write down the theoretical frequency (4th column), which is calculated using the following formula (the total frequencies in the corresponding line are multiplied by the total frequency in the corresponding column and divided by the total number of observations):

5. Determine the number of degrees of freedom using the formula: ν=(k-1)(m-1) , Where k- number of attribute digits X, m - number of digits of the sign at.

If ν=1, make a correction for “continuity” and write it in column 5a.

The continuity correction consists of subtracting another 0.5 from the difference between the conditional and theoretical frequencies. Then the column headings in our table will look like this (Table 21):

Table 21

X at f ij f ij * f ij – f ij * f ij – f ij * – 0.5 (f ij – f ij * – 0.5) 2 (f ij – f ij * – 0.5) 2 / f ij *
1 2 3 4 5 5a 6 7

6. Square the resulting differences and enter them in the 6th column.

7. Divide the resulting squared differences by the theoretical frequency and write the results in the 7th column.

8. Sum the values ​​of the 7th column. The resulting amount is designated as χ 2 em.

9. Decision rule:

The calculated value of the criterion must be compared with the critical (or tabulated) value. The critical value depends on the number of degrees of freedom according to the table of critical values ​​of the Pearson χ 2 criterion (see Appendix 1.6).

If χ 2 calc ≥ χ 2 table, then the discrepancies between the distributions are statistically significant, or the characteristics change consistently, or the relationship between the characteristics is statistically significant.

If χ 2 calculated< χ 2 табл, то расхождения между рас­пределениями статистически недостоверны, или признаки изменяются несогласованно, или связи между признаками нет.

If the obtained value of the χ 2 criterion is greater than the critical value, we conclude that there is a statistical relationship between the studied risk factor and the outcome at the appropriate level of significance.

Example of calculating the Pearson chi-square test

Let us determine the statistical significance of the influence of the smoking factor on the incidence of arterial hypertension using the table discussed above:

1. Calculate the expected values ​​for each cell:

2. Find the value of the Pearson chi-square test:

χ 2 = (40-33.6) 2 /33.6 + (30-36.4) 2 /36.4 + (32-38.4) 2 /38.4 + (48-41.6) 2 /41.6 = 4.396.

3. Number of degrees of freedom f = (2-1)*(2-1) = 1. Using the table, we find the critical value of the Pearson chi-square test, which at the significance level p=0.05 and the number of degrees of freedom 1 is 3.841.

4. We compare the obtained value of the chi-square test with the critical one: 4.396 > 3.841, therefore, the dependence of the incidence of arterial hypertension on the presence of smoking is statistically significant. The significance level of this relationship corresponds to p<0.05.

Also, the Pearson chi-square test is calculated using the formula

But for a 2x2 table, more accurate results are obtained by the Yates correction criterion

If That N(0) accepted,

When accepted H(1)

When the number of observations is small and the table cells contain a frequency less than 5, the chi-square test is not applicable and is used to test hypotheses Fisher's exact test . The procedure for calculating this criterion is quite labor-intensive, and in this case it is better to use computer statistical analysis programs.

Using the contingency table, you can calculate the measure of connection between two qualitative characteristics - this is the Yule association coefficient Q (analogous to the correlation coefficient)

Q lies in the range from 0 to 1. A coefficient close to one indicates a strong connection between the characteristics. If it is equal to zero, there is no connection .

The phi-square coefficient (φ 2) is used similarly

BENCHMARK TASK

The table describes the relationship between the mutation frequency in groups of Drosophila with and without feeding



Contingency table analysis

To analyze the contingency table, an H 0 hypothesis is put forward, i.e., the absence of influence of the characteristic being studied on the result of the study. For this, the expected frequency is calculated and an expectation table is constructed.

Waiting table

groups Chilo crops Total
Gave mutations Did not give mutations
Actual frequency Expected frequency Actual frequency Expected frequency
With feeding
Without feeding
Total

Method No. 1

Determine the waiting frequency:

2756 – X ;

2. 3561 – 3124

If the number of observations in groups is small, when using X 2, in the case of comparing actual and expected frequencies with discrete distributions, some inaccuracy is associated. To reduce the inaccuracy, the Yates correction is used.

Chi-square test.

The chi-square test, unlike the z test, is used to compare any number of groups.

Initial data: contingency table.

An example of a contingency table with a minimum dimension of 2*2 is given below. A, B, C, D – so-called real frequencies.

Sign 1 Sign 2 Total
Group 1 A B A+B
Group 2 C D C+D
Total A+C B+D A+B+C+D

The calculation of the criterion is based on a comparison of real frequencies and expected frequencies, which are calculated under the assumption that there is no mutual influence of the compared characteristics on each other. Thus, if the actual and expected frequencies are close enough to each other, then there is no influence and that means the characteristics will be distributed approximately equally across the groups.

The initial data for applying this method must be entered into a contingency table, the columns and rows of which indicate the variant values ​​of the characteristics being studied. The numbers in this table will be called real or experimental frequencies. Next, it is necessary to calculate the expected frequencies based on the assumption that the groups being compared are absolutely equal in the distribution of characteristics. In this case, the proportions for the total row or column “total” must be maintained in any row and column. Based on this, the expected frequencies are determined (see example).

Then the value of the criterion is calculated as the sum over all cells of the contingency table of the ratio of the square of the difference between the actual frequency and the expected frequency to the expected frequency:

where is the real frequency in the cell; - expected frequency in the cell.

, Where N = A+ B + C + D.

When calculating using the basic formula for table 2*2 ( only for this table ), it is also necessary to apply the Yates correction for continuity:

.

The critical value of the criterion is determined from the table (see appendix) taking into account the number of degrees of freedom and the level of significance. The significance level is taken as standard: 0.05; 0.01 or 0.001. The number of degrees of freedom is defined as the product of the number of rows and columns of the contingency table, each reduced by one:

,

Where r– number of lines (number of gradations of one characteristic), With– number of columns (number of gradations of another characteristic). This critical value can be determined in a Microsoft Excel spreadsheet using the function =x2rev( a, f), where instead of a you need to enter the significance level, and instead of f– number of degrees of freedom.

If the value of the chi-square test is greater than the critical value, then the hypothesis about the independence of the characteristics is rejected and they can be considered dependent at the selected level of significance.

This method has a limitation in applicability: the expected frequencies must be 5 or more (for a 2*2 table). For an arbitrary table, this restriction is less strict: all expected frequencies must be 1 or greater, and the proportion of cells with expected frequencies less than 5 must not exceed 20%.

From a contingency table of large dimensions, it is possible to “isolate” tables of smaller dimensions and calculate the value of the criterion c 2 for them. These will effectively be multiple comparisons similar to those described for the Student's t test. In this case, it is also necessary to apply a correction for multiple comparisons depending on their number.

To test a hypothesis using criterion c 2 in Microsoft Excel spreadsheets, you can use the following function:

HI2TEST(actual_interval; expected_interval).

Here actual_interval is the original contingency table with real frequencies (only cells with the frequencies themselves are indicated without headings and “total”); expected_interval – array of expected frequencies. Therefore, the expected frequencies must be calculated independently.

Example:

An outbreak of an infectious disease has occurred in a certain city. There is an assumption that the source of contamination was drinking water. They decided to test this assumption using a sample survey of the urban population, according to which it was necessary to determine whether the amount of water drunk affects the number of cases.

The source data is shown in the following table:

Let's calculate the expected frequencies. The proportion must remain the same within the table. Therefore, let’s calculate, for example, what share the lines make up in the total number, and we’ll get a coefficient for each line. The same proportion should appear in each cell of the corresponding row, therefore, to calculate the expected frequency in the cell, we multiply the coefficient by the total in the corresponding column.

The number of degrees of freedom is (3-1)*(2-1)=2. Critical Criterion Value .

The experimental value is greater than the critical value (61.5>13.816), i.e. the hypothesis that there is no effect of the amount of water drunk on morbidity is rejected with a probability of error of less than 0.001. Thus, it can be argued that it was water that became the source of the disease.

Both described criteria have limitations that are usually not met if the number of observations is small or individual gradations of characteristics are rare. In this case use Fisher's exact test . It is based on searching through all possible options for filling out the contingency table for a given number of groups. Therefore, manual calculation is quite complicated. To calculate it, you can use statistical application packages.

The z test is an analogue of the Student's test, but is used to compare qualitative characteristics. The experimental value of the criterion is calculated as the ratio of the difference in proportions to the average error in the difference in proportions.

The critical values ​​of the z criterion are equal to the corresponding points of the normalized normal distribution: , , .



The chi-square test is used to compare any number of groups according to the values ​​of qualitative characteristics. The source data must be presented in the form of a contingency table. The experimental value of the criterion is calculated as the sum over all cells of the contingency table of the ratio of the square of the difference between the actual frequency and the expected frequency to the expected frequency. Expected frequencies are calculated under the assumption that the characteristics being compared are equal in all groups. Critical values ​​are determined from chi-square distribution tables.

LITERATURE.

Glanz S. – Chapter 5.

Rebrova O.Yu. – Chapter 10,11.

Lakin G.F. - With. 120-123

Questions for self-testing of students.

1. In what cases can the z criterion be used?

2. What is the basis for calculating the experimental value of the z criterion?

3. How to find the critical value of the z criterion?

4. In what cases can the criterion c 2 be applied?

5. What is the basis for calculating the experimental value of the criterion c 2?

6. How to find the critical value of the criterion c 2?

7. What else can be used to compare quality characteristics if the criteria z and c 2 cannot be applied due to restrictions?

Tasks.