If the correlation coefficient is 1 then there is a connection. Correlation and correlation coefficient

7.3.1. Coefficients of correlation and determination. Can be quantified closeness of communication between factors and its focus(direct or inverse), calculating:

1) if it is necessary to determine a linear relationship between two factors, - pair coefficient correlations: in 7.3.2 and 7.3.3 the operations of calculating the paired linear correlation coefficient according to Bravais–Pearson ( r) and paired Spearman rank correlation coefficient ( r);

2) if we want to determine the relationship between two factors, but this relationship is clearly nonlinear, then correlation relation ;

3) if we want to determine the relationship between one factor and a certain set of other factors, then (or, which is the same thing, “multiple correlation coefficient”);

4) if we want to identify in isolation the connection of one factor only with a specific other, included in the group of factors influencing the first, for which we have to consider the influence of all other factors unchanged - then partial correlation coefficient .

Any correlation coefficient (r, r) cannot exceed 1 in absolute value, that is –1< r (r) < 1). Если получено значение 1, то это значит, что рассматриваемая зависимость не статистическая, а функциональная, если 0 - корреляции нет вообще.

The sign of the correlation coefficient determines the direction of the relationship: the “+” sign (or no sign) means that the relationship straight (positive), the “–” sign means that the connection reverse (negative). The sign has nothing to do with the closeness of the connection

The correlation coefficient characterizes the statistical relationship. But often it is necessary to determine another type of dependence, namely: what is the contribution of a certain factor to the formation of another factor associated with it. This kind of dependence is, with some degree of convention, characterized coefficient of determination (D ), determined by the formula D = r 2 ´100% (where r is the Bravais–Pearson correlation coefficient, see 7.3.2). If measurements were carried out in order scale (rank scale), then with some damage to reliability, instead of the value r, you can substitute the value r (Spearman correlation coefficient, see 7.3.3) into the formula.

For example, if we obtained, as a characteristic of the dependence of factor B on factor A, the correlation coefficient r = 0.8 or r = –0.8, then D = 0.8 2 ´100% = 64%, that is, about 2 ½ 3. Consequently, the contribution of factor A and its changes to the formation of factor B is approximately 2 ½ 3 from the total contribution of all factors in general.

7.3.2. Bravais-Pearson correlation coefficient. The procedure for calculating the Bravais–Pearson correlation coefficient ( r ) can only be used in cases where the relationship is considered on the basis of samples having a normal frequency distribution ( normal distribution ) and obtained by measurements on interval or ratio scales. The calculation formula for this correlation coefficient is:



å ( x i – )( y i – )

r = .

n×s x ×s y

What does the correlation coefficient show? Firstly, the sign of the correlation coefficient shows the direction of the relationship, namely: the “–” sign indicates that the relationship reverse, or negative(there is a tendency: with a decrease in the values ​​of one factor, the corresponding values ​​of another factor increase, and with an increase, they decrease), and the absence of a sign or the “+” sign indicates straight, or positive connections (there is a tendency: with an increase in the values ​​of one factor, the values ​​of another increase, and with a decrease, they decrease). Secondly, the absolute (sign-independent) value of the correlation coefficient indicates the closeness (strength) of the connection. It is generally accepted (rather arbitrarily): for values ​​of r< 0,3 корреляция very weak, often it is simply not taken into account, at 0.3 £ r< 5 корреляция weak, at 0.5 £ r< 0,7) - average, at 0.7 £ r £ 0.9) - strong and finally, for r > 0.9 - very strong. In our case (r » 0.83) the relationship is inverse (negative) and strong.

Let us remind you: the values ​​of the correlation coefficient can be in the range from –1 to +1. If the value of r goes beyond these limits, it indicates that in the calculations a mistake was made . If r= 1, this means that the connection is not statistical, but functional - which practically never happens in sports, biology, or medicine. Although, with a small number of measurements, a random selection of values ​​that gives a picture of the functional connection is possible, such a case is less likely, the larger the volume of compared samples (n), that is, the number of pairs of compared measurements.

The calculation table (Table 7.1) is constructed according to the formula.

Table 7.1.

Calculation table for Bravais–Pearson calculations

x i y i (x i – ) (x i – ) 2 (y i – ) (y i – ) 2 (x i – )( y i – )
13,2 4,75 0,2 0,04 –0,35 0,1225 – 0,07
13,5 4,7 0,5 0,25 – 0,40 0,1600 – 0,20
12,7 5,10 – 0,3 0,09 0,00 0,0000 0,00
12,5 5,40 – 0,5 0,25 0,30 0,0900 – 0,15
13,0 5,10 0,0 0,00 0,00 0.0000 0,00
13,2 5,00 0,1 0,01 – 0,10 0,0100 – 0,02
13,1 5,00 0,1 0,01 – 0,10 0,0100 – 0,01
13,4 4,65 0,4 0,16 – 0,45 0,2025 – 0,18
12,4 5,60 – 0,6 0,36 0,50 0,2500 – 0,30
12,3 5,50 – 0,7 0,49 0,40 0,1600 – 0,28
12,7 5,20 –0,3 0,09 0,10 0,0100 – 0,03
åx i =137 =13.00 åy i =56.1 =5.1 å( x i – ) 2 = =1.78 å( y i – ) 2 = = 1.015 å( x i – )( y i – )= = –1.24

Because the s x = ï ï = ï ï» 0.42, a

s y = ï ï» 0,32, r" –1,24ï (11´0.42´0.32) » –1,24ï 1,48 » –0,83 .

In other words, you need to know very firmly that the correlation coefficient can not exceed 1.0 in absolute value. This often allows you to avoid gross mistakes, or more precisely, to find and correct errors made during calculations.

7.3.3. Spearman correlation coefficient. As already mentioned, the Bravais–Pearson correlation coefficient (r) can be used only in cases where the analyzed factors are close to normal in frequency distribution and the variant values ​​are obtained by measurements necessarily on a ratio scale or on an interval scale, which happens if they are expressed physical units. In other cases, the Spearman correlation coefficient is found ( r). However, this coefficient Can apply in cases where it is permitted (and desirable ! ) apply the Bravais-Pearson correlation coefficient. But it should be borne in mind that the procedure for determining the coefficient according to Bravais-Pearson has higher power (“resolving ability"), That's why r more informative than r. Even with great n deviation r may be on the order of ±10%.

Table 7.2 Calculation formula for coefficient

x i y i R x R y |d R | d R 2 Spearman correlation

13,2 4,75 8,5 3,0 5,5 30,25 r= 1 – . Vos

13.5 4.70 11.0 2.0 9.0 81.00 we use our example

12.7 5.10 4.5 6.5 2.0 4.00 for calculation r, but we'll build

12.5 5.40 3.0 9.0 6.0 36.00 another table (Table 7.2).

13.0 5.10 6.0 6.5 0.5 0.25 Let’s substitute the values:

13.2 5.00 8.5 4.5 4.0 16.00 r = 1– =

13,1 5,00 7,0 4,5 2,5 6,25 =1– 2538:1320 » 1–1,9 » – 0,9.

13.4 4.65 10.0 1.0 9.0 81.00 We see: r turned out to be a little

12.4 5.60 2.0 11.0 9.0 81.00 more than r, but this is different

12.3 5.50 1.0 10.0 9.0 81.00 which is not very large. After all, when

12.7 5.20 4.5 8.0 3.5 12.25 so small n values r And r

åd R 2 = 423 are very approximate, not very reliable, their actual value can vary widely, so the difference r And r at 0.1 is insignificant. Usuallyrconsidered as an analoguer , but only less accurate. Signs when r And r shows the direction of the connection.

7.3.4. Application and verification of the reliability of correlation coefficients. Determining the degree of correlation between factors is necessary to control the development of the factor we need: to do this, we have to influence other factors that significantly influence it, and we need to know the extent of their effectiveness. It is necessary to know about the relationship between factors in order to develop or select ready-made tests: the informativeness of a test is determined by the correlation of its results with the manifestations of the characteristic or property that interests us. Without knowledge of correlations, any form of selection is impossible.

It was noted above that in sports and in general pedagogical, medical and even economic and sociological practice, the determination of what contribution , which one factor contributes to the formation of another. This is due to the fact that, in addition to the factor-cause under consideration, target(the factor that interests us) act, giving each one one or another contribution to it, and others.

It is believed that the measure of the contribution of each factor-cause can be coefficient of determination D i = r 2 ´100%. So, for example, if r = 0.6, i.e. the relationship between factors A and B is average, then D = 0.6 2 ´100% = 36%. Knowing, therefore, that the contribution of factor A to the formation of factor B is approximately 1 ½ 3, you can, for example, devote approximately 1 to the targeted development of this factor ½ 3 training times. If the correlation coefficient is r = 0.4, then D = r 2 100% = 16%, or approximately 1 ½ 6 is more than two times less, and according to this logic, according to this logic, only 1 should be devoted to its development ½ 6th part of training time.

The values ​​of D i for various significant factors give an approximate idea of ​​​​the quantitative relationship of their influences on the target factor of interest to us, for the sake of improving which we, in fact, work on other factors (for example, a running long jumper works to increase the speed of his sprinting, so how it is the factor that makes the most significant contribution to the formation of results in jumping).

Recall that defining D maybe instead r put r, although, of course, the accuracy of the determination turns out to be lower.

Based selective correlation coefficient (calculated from sample data), one cannot draw a conclusion about the reliability of the fact that there is a connection between the factors under consideration in general. In order to make such a conclusion with varying degrees of validity, standard correlation significance criteria. Their use assumes a linear relationship between factors and normal distribution frequencies in each of them (meaning not a selective, but their general representation).

You can, for example, use Student's t-tests. His dis-

even formula: t p= –2 , where k is the sample correlation coefficient under study, a n- volume of compared samples. The resulting calculated value of the t-criterion (t p) is compared with the table at the significance level we have chosen and the number of degrees of freedom n = n – 2. To get rid of the calculation work, you can use a special table critical values ​​of sample correlation coefficients(see above), corresponding to the presence of a reliable connection between factors (taking into account n And a).

Table 7.3.

Boundary values ​​for the reliability of the sample correlation coefficient

The number of degrees of freedom when determining correlation coefficients is taken equal to 2 (i.e. n= 2) Indicated in the table. 7.3 values ​​have the lower limit of the confidence interval true correlation coefficient is 0, that is, with such values ​​it cannot be argued that correlation occurs at all. If the value of the sample correlation coefficient is higher than that indicated in the table, it can be assumed, at the appropriate level of significance, that the true correlation coefficient is not equal to zero.

But the answer to the question whether there is a real connection between the factors under consideration leaves room for another question: in what interval does the true meaning correlation coefficient, as it may actually be, for an infinitely large n? This interval for any particular value r And n comparable factors can be calculated, but it is more convenient to use a graph system ( nomogram), where each pair of curves constructed for some specified above them n, corresponds to the boundaries of the interval.

Rice. 7.4. Confidence limits of the sample correlation coefficient (a = 0.05). Each curve corresponds to the one indicated above it n.

Referring to the nomogram in Fig. 7.4, it is possible to determine the interval of values ​​of the true correlation coefficient for the calculated values ​​of the sample correlation coefficient at a = 0.05.

7.3.5. Correlation relationships. If pairwise correlation nonlinear, it is impossible to calculate the correlation coefficient, determine correlation relationships . Mandatory requirement: characteristics must be measured on a ratio scale or on an interval scale. You can calculate the correlation dependence of the factor X from factor Y and correlation dependence of the factor Y from factor X- they differ. For small volume n of the considered samples representing factors, to calculate correlation relationships, you can use the formulas:

correlation ratio h x½y= ;

correlation relation h y ½ x= .

Here and are the arithmetic means of samples X and Y, and - intraclass arithmetic averages. That is, the arithmetic mean of those values ​​in the sample of factor X with which identical values ​​are conjugate in the sample of factor Y (for example, if in factor X there are values ​​4, 6, and 5, with which in the sample of factor Y there are associated 3 options with the same value 9, then = (4+6+5) ½ 3 = 5). Accordingly, it is the arithmetic mean of those values ​​in the sample of factor Y, which are associated with the same values ​​in the sample of factor X. Let’s give an example and carry out the calculation:

X: 75 77 78 76 80 79 83 82 ; Y: 42 42 43 43 43 44 44 45 .

Table 7.4

Calculation table

x i y i x y x i – x (x i – x) 2 x i – x y (x ix y) 2
–4 –1
–2
–3 –2
–1
–3
x=79 y=43 S=76 S=28

Therefore, h y ½ x= "0.63.

7.3.6. Partial and multiple correlation coefficients. To assess the dependence between 2 factors, when calculating correlation coefficients, we assume by default that no other factors have any effect on this dependence. In reality this is not the case. Thus, the relationship between weight and height is very significantly influenced by caloric intake, the amount of systematic physical activity, heredity, etc. When necessary when assessing the relationship between 2 factors take into account the significant impact other factors and at the same time, as it were, isolate yourself from them, considering them unchanged, calculate private (otherwise - partial ) correlation coefficients.

Example: we need to evaluate paired dependencies between 3 significantly active factors X, Y and Z. Let us denote r XY (Z) partial correlation coefficient between factors X and Y (in this case, the value of factor Z is considered unchanged), r ZX (Y) - partial correlation coefficient between factors Z and X (with a constant value of factor Y), r YZ (X) - partial correlation coefficient between factors Y and Z (with a constant value of factor X). Using the calculated simple paired (Bravais-Pearson) correlation coefficients r XY, r XZ and r YZ, m

You can calculate partial correlation coefficients using the formulas:

r XY – r XZ´ r YZ r XZ – r XY´ r ZY r ZY –r ZX ´ r YZ

r XY(Z) = ; r XZ(Y) = ; r ZY(X) =

Ö(1– r 2 XZ)(1– r 2 YZ) Ö(1– r 2 XY)(1– r 2 ZY) Ö(1– r 2 ZX)(1– r 2 YX)

And partial correlation coefficients can take values ​​from –1 to +1. By squaring them, we obtain the corresponding quotients coefficients of determination , also called private measures of certainty(multiply by 100 and express it as %%). Partial correlation coefficients differ more or less from simple (full) pair coefficients, which depends on the strength of influence of the 3rd factor (as if unchanged) on them. The null hypothesis (H 0), that is, the hypothesis about the absence of a connection (dependence) between factors X and Y, is tested (with a total number of signs k) by calculating the t-test using the formula: t P = r XY (Z) ´ ( n–k) 1 ½ 2 ´ (1– r 2 XY (Z)) –1 ½ 2 .

If t R< t a n , the hypothesis is accepted (we assume that there is no dependence), but if tt a n - the hypothesis is refuted, that is, it is believed that the dependence really takes place. t a n is taken from the table t-Student's test, and k- the number of factors taken into account (in our example 3), the number of degrees of freedom n= n – 3. Other partial correlation coefficients are checked similarly (in the formula instead r XY (Z) is substituted accordingly r XZ(Y) or r ZY(X)).

Table 7.5

Initial data

Ö (1 – 0.71 2)(1 – 0.71 2) Ö (1 – 0.5)(1 – 0.5)

To assess the dependence of factor X on the combined action of several factors (here factors Y and Z), calculate the values ​​of simple pair correlation coefficients and, using them, calculate multiple correlation coefficient r X (YZ) :

Ö r 2XY+ r 2 XZ – 2 r XY´ r XZ´ r YZ

r X(YZ) = .

Ö 1 – r 2 YZ

7.2.7. Association coefficient. It is often necessary to quantify the relationship between quality signs, i.e. such characteristics that cannot be represented (characterized) quantitatively, which immeasurable. For example, the task is to find out whether there is a relationship between the sports specialization of those involved and such personal properties as introversion (the personality’s focus on the phenomena of their own subjective world) and extroversion (the personality’s focus on the world of external objects). We present the symbols in the table. 7.6.

Table 7.6.

X (years) Y (times) Z (times) X (years) Y (times) Z (times)
Sign 1 Sign 2 Introversion Extroversion
Sport games A b
Gymnastics With d

Obviously, the numbers at our disposal here can only be distribution frequencies. In this case, calculate association coefficient (other name " contingency coefficient "). Let's consider the simplest case: a relationship between two pairs of features, and the calculated contingency coefficient is called tetrachoric (see table).

Table 7.7.

a =20 b = 15 a + b = 35
s =15 d=5 c + d = 20
a + c = 35 b + d = 20 n = 55

We make calculations using the formula:

ad – bc 100 – 225 –123

The calculation of association coefficients (conjugation coefficients) with a larger number of characteristics involves calculations using a similar matrix of the appropriate order.

Where x·y, x, y are the average values ​​of the samples; σ(x), σ(y) - standard deviations.
Besides, Pearson linear pair correlation coefficient can be determined through the regression coefficient b: , where σ(x)=S(x), σ(y)=S(y) - standard deviations, b - coefficient before x in the regression equation y=a+bx.

Other formula options:
or

K xy - correlation moment (covariance coefficient)

To find the linear Pearson correlation coefficient, it is necessary to find the sample means x and y, and their standard deviations σ x = S(x), σ y = S(y):

The linear correlation coefficient indicates the presence of a relationship and takes values ​​from –1 to +1 (see Chaddock scale). For example, when analyzing the closeness of the linear correlation between two variables, a paired linear correlation coefficient equal to –1 was obtained. This means that there is an exact inverse linear relationship between the variables.

You can calculate the value of the correlation coefficient using the given sample averages, or directly.

Xy#x #y #σ x #σ y " data-id="a;b;c;d;e" data-formul="(a-b*c)/(d*e)" data-r="r xy ">Calculate your value

Geometric meaning of the correlation coefficient: r xy shows how different the slope of two regression lines: y(x) and x(y) is, and how much the results of minimizing deviations in x and y differ. The greater the angle between the lines, the greater r xy.
The sign of the correlation coefficient coincides with the sign of the regression coefficient and determines the slope of the regression line, i.e. general direction of dependence (increasing or decreasing). The absolute value of the correlation coefficient is determined by the degree of proximity of the points to the regression line.

Properties of the correlation coefficient

  1. |r xy | ≤ 1;
  2. if X and Y are independent, then r xy =0, the converse is not always true;
  3. if |r xy |=1, then Y=aX+b, |r xy (X,aX+b)|=1, where a and b are constants, a ≠ 0;
  4. |r xy (X,Y)|=|r xy (a 1 X+b 1, a 2 X+b 2)|, where a 1, a 2, b 1, b 2 are constants.

Therefore for checking communication direction a hypothesis test is selected using the Pearson correlation coefficient with further reliability testing using t-test(see example below).

Typical tasks (see also nonlinear regression)

Typical tasks
The dependence of labor productivity y on the level of mechanization of work x (%) is studied according to data from 14 industrial enterprises. Statistical data is shown in the table.
Required:
1) Find estimates of the linear regression parameters y on x. Construct a scatterplot and plot the regression line on the scatterplot.
2) At the significance level α=0.05, test the hypothesis about the agreement of linear regression with the observation results.
3) With reliability γ=0.95, find confidence intervals for linear regression parameters.

The following are also used with this calculator:
Multiple regression equation

Example. Based on the data given in Appendix 1 and corresponding to your option (Table 2), the following is required:

  1. Calculate the linear pair correlation coefficient and construct an equation for linear pair regression of one characteristic from another. One of the characteristics corresponding to your option will play the role of a factor (x), the other will play the role of a resultant (y). Establish cause-and-effect relationships between characteristics yourself based on economic analysis. Explain the meaning of the parameters of the equation.
  2. Determine the theoretical coefficient of determination and residual (unexplained by the regression equation) variance. Draw a conclusion.
  3. Assess the statistical significance of the regression equation as a whole at the five percent level using Fisher's F test. Draw a conclusion.
  4. Make a forecast of the expected value of the result trait y with the predicted value of the factor trait x being 105% of the average level x. Assess the accuracy of the forecast by calculating the forecast error and its confidence interval with a probability of 0.95.
Solution. The equation is y = ax + b
Average values



Dispersion


Standard deviation



The connection between trait Y and factor X is strong and direct (determined by the Chaddock scale).
Regression equation

Regression coefficient: k = a = 4.01
Determination coefficient
R 2 = 0.99 2 = 0.97, i.e. in 97% of cases, changes in x lead to changes in y. In other words, the accuracy of selecting the regression equation is high. Residual variance: 3%.
xyx 2y 2x yy(x)(y i -y ) 2(y-y(x)) 2(x-x p) 2
1 107 1 11449 107 103.19 333.06 14.5 30.25
2 109 4 11881 218 107.2 264.06 3.23 20.25
3 110 9 12100 330 111.21 232.56 1.47 12.25
4 113 16 12769 452 115.22 150.06 4.95 6.25
5 120 25 14400 600 119.23 27.56 0.59 2.25
6 122 36 14884 732 123.24 10.56 1.55 0.25
7 123 49 15129 861 127.26 5.06 18.11 0.25
8 128 64 16384 1024 131.27 7.56 10.67 2.25
9 136 81 18496 1224 135.28 115.56 0.52 6.25
10 140 100 19600 1400 139.29 217.56 0.51 12.25
11 145 121 21025 1595 143.3 390.06 2.9 20.25
12 150 144 22500 1800 147.31 612.56 7.25 30.25
78 1503 650 190617 10343 1503 2366.25 66.23 143

Note: the values ​​of y(x) are found from the resulting regression equation:
y(1) = 4.01*1 + 99.18 = 103.19
y(2) = 4.01*2 + 99.18 = 107.2
... ... ...

Significance of the correlation coefficient

We put forward hypotheses:
H 0: r xy = 0, there is no linear relationship between the variables;
H 1: r xy ≠ 0, there is a linear relationship between the variables;
In order to test the null hypothesis at the significance level α that the general correlation coefficient of a normal two-dimensional random variable is equal to zero under the competing hypothesis H 1 ≠ 0, it is necessary to calculate the observed value of the criterion (random error value):

Using the Student's table we find t table (n-m-1;α/2) = (10;0.025) = 2.228
Since Tob > t tab, we reject the hypothesis that the correlation coefficient is equal to 0. In other words, the correlation coefficient is statistically significant.
Interval estimate for the correlation coefficient (confidence interval)


r - Δ r ≤ r ≤ r + Δ r
Δ r = ±t table m r = ±2.228 0.0529 = 0.118
0.986 - 0.118 ≤ r ≤ 0.986 + 0.118
Confidence interval for the correlation coefficient: 0.868 ≤ r ≤ 1

Analysis of the accuracy of determining estimates of regression coefficients





S a =0.2152

Confidence intervals for the dependent variable

Let us calculate the boundaries of the interval in which 95% of the possible values ​​of Y will be concentrated with an unlimited number of observations and X = 7
(122.4;132.11)
Testing hypotheses regarding the coefficients of a linear regression equation

1) t-statistics




The statistical significance of the regression coefficient is confirmed
Confidence interval for regression equation coefficients
Let us determine the confidence intervals of the regression coefficients, which with a reliability of 95% will be as follows:
(a - t a S a ; a + t a S a)
(3.6205;4.4005)
(b - t b S b ; b + t b S b)
(96.3117;102.0519)


Correlation coefficients

Until now, we have only clarified the fact of the existence of a statistical relationship between two characteristics. Next, we will try to find out what conclusions can be drawn about the strength or weakness of this dependence, as well as about its type and direction. Criteria for quantifying the relationship between variables are called correlation coefficients or measures of connectivity. Two variables are positively correlated if there is a direct, unidirectional relationship between them. In a unidirectional relationship, small values ​​of one variable correspond to small values ​​of another variable, and large values ​​correspond to large values. Two variables correlate negatively with each other if there is an inverse, multidirectional relationship between them. With a multidirectional relationship, small values ​​of one variable correspond to large values ​​of another variable and vice versa. The values ​​of correlation coefficients always lie in the range from -1 to +1.

As a correlation coefficient between variables belonging to ordinal scale applies Spearman coefficient, and for variables belonging to interval scale - Pearson correlation coefficient(moment of works). It should be taken into account that each dichotomous variable, that is, a variable belonging to a nominal scale and having two categories, can be considered as ordinal.

First, we will check if there is a correlation between the sex and psyche variables from the studium.sav file. In this case, the dichotomous variable sex can be considered ordinal. Follow these steps:

    Select Analyze Descriptive Statistics Crosstabs... from the command menu

    Move the variable sex to a list of strings, and a variable psyche- to the list of columns.

    Click the button Statistics... (Statistics). In the Crosstabs: Statistics dialog, select the Correlations checkbox. Confirm your selection with the Continue button.

    In dialogue Crosstabs refuse to display tables by checking the Supress tables checkbox. Click OK.

Spearman and Pearson correlation coefficients will be calculated and their significance tested:

Symmetric Measures

Value Asymptomatic Std. Error (a) (Asymptotic standard error) Approx. T (b) (Approx. T) Approx. Sig. (Approximate significance)
Interval by Interval Pearson's R
(R Pearson)
,441 ,081 5,006 .000 (s)
Ordinal by Ordinal (Ordinal - Ordinal) Spearman Correlation ,439 ,083 4,987 .000 (s)
N of Valid Cases 106

Since there are no interval scale variables here, we will look at the Spearman correlation coefficient. It is 0.439 and is maximally significant (p<0,001).

For a verbal description of the correlation coefficient values, the following table is used:

Based on the above table, we can draw the following conclusions: There is a weak correlation between the sex and psyche variables (conclusion about the strength of the dependence), the variables correlate positively (conclusion about the direction of the dependence).

In the psyche variable, smaller values ​​correspond to a negative mental state, and larger values ​​correspond to a positive one. In the sex variable, in turn, the value “1” corresponds to the female gender, and “2” to the male gender.

Consequently, the unidirectionality of the relationship can be interpreted as follows: female students assess their mental state more negatively than their male colleagues or, most likely, are more inclined to agree to such an assessment when conducting a survey. When constructing such interpretations, it is necessary to take into account that a correlation between two traits does not necessarily equate to their Functional or Causal Dependence.See Section 15.3 for more on this.

Now let's check the correlation between the alter and semester variables. Let's apply the method described above. We will get the following coefficients:

Symmetric Measures

Asymptomatic Std. Error(a)

Interval by Interval

Ordinal by Ordinal

Spearman Correlation

N of Valid Cases

a. Not assuming the null hypothesis.

e. Using the asymptotic standard error assuming the null hypothesis.

With. Based on normal approximation.

Since the variables alter and semester are metric, we will consider the Pearson coefficient (moment of products). It is 0.807. There is a strong correlation between the alter and semester variables. The variables are positively correlated. Consequently, older students study in senior years, which, in fact, is not an unexpected conclusion.

Let's check the variables sozial (assessment of social status) and psyche for correlation. We will get the following coefficients:

Symmetric Measures

Asymptomatic Std. Error(a)

Interval by Interval

Ordinal by Ordinal

Spearman Correlation

N of Valid Cases

a. Not assuming the null hypothesis.

b. Using the asymptotic standard error assuming the null hypothesis.

With. Based on normal approximation.

In this case, we will look at the Spearman correlation coefficient; it is -0.703. There is a medium to strong correlation between the sozial and psyche variables (cutoff value 0.7). The variables correlate negatively, that is, the higher the value of the first variable, the lower the value of the second and vice versa. Since small values ​​of the sozial variable characterize a positive state (1 = very good, 2 = good), and large values ​​of psyche characterize a negative state (1 = extremely unstable, 2 = unstable), therefore, psychological difficulties are largely due to social problems.

Correlation coefficient is a value that can vary from +1 to –1. In the case of a complete positive correlation, this coefficient is equal to plus 1 (they say that when the value of one variable increases, the value of another variable increases), and in the case of a completely negative correlation, it is minus 1 (indicating feedback, i.e., when the values ​​of one variable increase, the values ​​of the other decrease).

Ex.1:

Graph of the relationship between shyness and depression. As you can see, the points (subjects) are not located chaotically, but line up around one line, and, looking at this line, we can say that the higher a person’s shyness, the greater the depression, i.e. these phenomena are interconnected.

Ex2: Chart for Shyness and Sociability. We see that as shyness increases, sociability decreases. Their correlation coefficient is -0.43. Thus, a correlation coefficient greater than 0 to 1 indicates a directly proportional relationship (the more... the more...), and a coefficient from -1 to 0 indicates an inversely proportional relationship (the more... the less...)

If the correlation coefficient is 0, both variables are completely independent of each other.

Correlation- this is a relationship where the impact of individual factors appears only as a trend (on average) during mass observation of actual data. Examples of correlation dependencies can be the dependencies between the size of the bank’s assets and the amount of the bank’s profit, the growth of labor productivity and the length of service of employees.

Two systems are used to classify correlations according to their strength: general and specific.

General classification of correlations: 1) strong, or close with a correlation coefficient r>0.70; 2) average with 0.500.70, and not just a correlation of a high level of significance.

The following table shows the names of the correlation coefficients for various types of scales.

Dichotomous scale (1/0) Rank (ordinal) scale
Dichotomous scale (1/0) Pearson's coefficient of association, Pearson's four-cell contingency coefficient. Biserial correlation
Rank (ordinal) scale Rank-biserial correlation. Spearman or Kendall rank correlation coefficient.
Interval and absolute scale Biserial correlation The values ​​of the interval scale are converted into ranks and the rank coefficient is used Pearson correlation coefficient (linear correlation coefficient)

At r=0 There is no linear correlation. In this case, the group means of the variables coincide with their overall means, and the regression lines are parallel to the coordinate axes.

Equality r=0 speaks only about the absence of a linear correlation dependence (uncorrelated variables), but not generally about the absence of a correlation, and even more so, a statistical dependence.

Sometimes a finding of no correlation is more important than the presence of a strong correlation. A zero correlation between two variables may indicate that there is no influence of one variable on the other, provided we trust the measurement results.

In SPSS: 11.3.2 Correlation coefficients

Until now, we have only clarified the fact of the existence of a statistical relationship between two characteristics. Next, we will try to find out what conclusions can be drawn about the strength or weakness of this dependence, as well as about its type and direction. Criteria for quantifying the relationship between variables are called correlation coefficients or measures of connectivity. Two variables are positively correlated if there is a direct, unidirectional relationship between them. In a unidirectional relationship, small values ​​of one variable correspond to small values ​​of another variable, and large values ​​correspond to large values. Two variables correlate negatively with each other if there is an inverse, multidirectional relationship between them. With a multidirectional relationship, small values ​​of one variable correspond to large values ​​of another variable and vice versa. The values ​​of correlation coefficients always lie in the range from -1 to +1.

The Spearman coefficient is used as a correlation coefficient between variables belonging to an ordinal scale, and the Pearson correlation coefficient (moment of products) is used for variables belonging to an interval scale. It should be taken into account that each dichotomous variable, that is, a variable belonging to a nominal scale and having two categories, can be considered as ordinal.

First, we will check if there is a correlation between the sex and psyche variables from the studium.sav file. At the same time, we will take into account that the dichotomous variable sex can be considered ordinal. Follow these steps:

· Select from the command menu Analyze Descriptive Statistics Crosstabs...

· Move the variable sex to the list of rows and the variable psyche to the list of columns.

· Click on the Statistics... button. In the Crosstabs: Statistics dialog, select the Correlations checkbox. Confirm your selection with the Continue button.

· In the Crosstabs dialog, disable the display of tables by checking the Supress tables checkbox. Click OK.

Spearman and Pearson correlation coefficients will be calculated and their significance tested:

/ SPSS 10

Task No. 10 Correlation analysis

Concept of correlation

Correlation or correlation coefficient is a statistical indicator probabilistic relationships between two variables measured on quantitative scales. Unlike a functional relationship, in which each value of one variable corresponds strictly defined the value of another variable, probabilistic connection characterized by the fact that each value of one variable corresponds multiple meanings another variable. An example of a probabilistic relationship is the relationship between people's height and weight. It is clear that people of different weights can have the same height and vice versa.

Correlation is a value ranging from -1 to + 1 and is denoted by the letter r. Moreover, if the value is closer to 1, then this means the presence of a strong connection, and if closer to 0, then it is weak. A correlation value of less than 0.2 is considered a weak correlation, and a value greater than 0.5 is considered a high correlation. If the correlation coefficient is negative, this means that there is feedback: the higher the value of one variable, the lower the value of the other.

Depending on the accepted values ​​of the coefficient r, various types of correlation can be distinguished:

Strict positive correlation determined by the value r=1. The term "strict" means that the value of one variable is uniquely determined by the values ​​of another variable, and the term " positive" - that as the values ​​of one variable increase, the values ​​of another variable also increase.

Strict correlation is a mathematical abstraction and practically never occurs in real research.

Positive correlation corresponds to values ​​0

No correlation determined by the value r=0. A zero correlation coefficient indicates that the values ​​of the variables are in no way related to each other.

No correlation H o : 0 r xy =0 formulated as a reflection null hypotheses in correlation analysis.

Negative correlation: -1

Strict negative correlation determined by the value r= -1. It, like a strict positive correlation, is an abstraction and does not find expression in practical research.

Table 1

Types of correlation and their definitions

The method for calculating the correlation coefficient depends on the type of scale on which the variable values ​​are measured.

Correlation coefficient rPearson is basic and can be used for variables with nominal and partially ordered interval scales, the distribution of values ​​​​on which corresponds to normal (product moment correlation). The Pearson correlation coefficient gives fairly accurate results in cases of abnormal distributions.

For distributions that are not normal, it is preferable to use Spearman and Kendall rank correlation coefficients. They are ranked because the program pre-ranks the correlated variables.

The SPSS program calculates Spearman's correlation as follows: first, the variables are converted to ranks, and then the Pearson's formula is applied to the ranks.

The basis of the correlation proposed by M. Kendall is the idea that the direction of the connection can be judged by comparing subjects in pairs. If for a pair of subjects the change in X coincides in direction with the change in Y, then this indicates a positive connection. If it does not match, then there is a negative connection. This coefficient is used primarily by psychologists working with small samples. Since sociologists work with large amounts of data, enumerating pairs and identifying the difference in relative frequencies and inversions of all pairs of subjects in the sample is difficult. The most common is the coefficient. Pearson.

Since the Pearson correlation coefficient r is basic and can be used (with some error depending on the type of scale and the level of abnormality in the distribution) for all variables measured on quantitative scales, we will consider examples of its use and compare the results obtained with the results of measurements using other correlation coefficients.

Formula for calculating the coefficient r- Pearson:

r xy = ∑ (Xi-Xavg)∙(Yi-Yavg) / (N-1)∙σ x ∙σ y ∙

Where: Xi, Yi - Values ​​of two variables;

Xavg, Yavg - average values ​​of two variables;

σ x, σ y – standard deviations,

N is the number of observations.

Pairwise correlations

For example, we would like to find out how the answers correlate between different types of traditional values ​​in students’ ideas about an ideal place to work (variables: a9.1, a9.3, a9.5, a9.7), and then about the correlation between liberal values ​​(a9 .2, a9.4, a9.6, a9.8) . These variables are measured on 5-item ordered scales.

We use the procedure: “Analysis”,  “Correlations”,  “Paired”. Default coefficient Pearson is set in the dialog box. We use the coefficient. Pearson

The tested variables are transferred to the selection window: a9.1, a9.3, a9.5, a9.7

By clicking OK we get the calculation:

Correlations

a9.1.t. How important is it to have enough time for family and personal life?

Pearson correlation

Value(2 sides)

a9.3.t. How important is it not to be afraid of losing your job?

Pearson correlation

Value(2 sides)

a9.5.t. How important is it to have a boss who will consult with you when making this or that decision?

Pearson correlation

Value(2 sides)

a9.7.t. How important is it to work in a well-coordinated team and feel like part of it?

Pearson correlation

Value(2 sides)

** Correlation is significant at the 0.01 level (2-sided).

Table of quantitative values ​​of the constructed correlation matrix

Partial correlations:

First, let's build a pairwise correlation between these two variables:

Correlations

s8. Feel close to those who live next to you, neighbors

Pearson correlation

Value(2 sides)

s12. Feel close to their family

Pearson correlation

Value(2 sides)

**. The correlation is significant at the 0.01 level (2-sided).

Then we use the procedure for constructing a partial correlation: “Analysis”,  “Correlations”,  “Partial”.

Let us assume that the value “It is important to independently determine and change the order of your work” in relation to the specified variables turns out to be the decisive factor under the influence of which the previously identified relationship will disappear or turn out to be insignificant.

Correlations

Excluded Variables

s8. Feel close to those who live next to you, neighbors

s12. Feel close to their family

p16. Feel close to people who have the same income as you

s8. Feel close to those who live next to you, neighbors

Correlation

Significance (2-sided)

s12. Feel close to their family

Correlation

Significance (2-sided)

As can be seen from the table, under the influence of the control variable, the relationship decreased slightly: from 0.120 to 0.102. However, this slight decrease does not allow us to state that the previously identified relationship is a reflection of a false correlation, because it remains quite high and allows us to reject the null hypothesis with zero error.

Correlation coefficient

The most accurate way to determine the closeness and nature of the correlation is to find the correlation coefficient. The correlation coefficient is a number determined by the formula:


where r xy is the correlation coefficient;

x i - values ​​of the first characteristic;

y i are the values ​​of the second attribute;

Arithmetic mean of the values ​​of the first characteristic

Arithmetic mean of the values ​​of the second characteristic

To use formula (32), we will build a table that will provide the necessary consistency in preparing numbers to find the numerator and denominator of the correlation coefficient.

As can be seen from formula (32), the sequence of actions is as follows: we find the arithmetic averages of both characteristics x and y, we find the difference between the values ​​of the attribute and its average (x i - ) and y i - ), then we find their product (x i - ) ( y i - ) – the sum of the latter gives the numerator of the correlation coefficient. To find its denominator, the differences (x i - ) and (y i - ) must be squared, their sums must be found, and the square root of their product must be taken.

So for example 31, finding the correlation coefficient in accordance with formula (32) can be represented as follows (Table 50).

The resulting number of the correlation coefficient makes it possible to establish the presence, closeness and nature of the connection.

1. If the correlation coefficient is zero, there is no connection between the characteristics.

2. If the correlation coefficient is equal to one, the connection between the characteristics is so great that it turns into a functional one.

3. The absolute value of the correlation coefficient does not go beyond the interval from zero to one:

This makes it possible to focus on the closeness of the connection: the closer the coefficient is to zero, the weaker the connection, and the closer to unity, the closer the connection.

4. The “plus” sign of the correlation coefficient means direct correlation, the “minus” sign means inverse correlation.

Table 50

x i y i (x i - ) (у i - ) (x i - )(y i - ) (x i - )2 (у i - )2
14,00 12,10 -1,70 -2,30 +3,91 2,89 5,29
14,20 13,80 -1,50 -0,60 +0,90 2,25 0,36
14,90 14,20 -0,80 -0,20 +0,16 0,64 0,04
15,40 13,00 -0,30 -1,40 +0,42 0,09 1,96
16,00 14,60 +0,30 +0,20 +0,06 0,09 0,04
17,20 15,90 +1,50 +2,25 2,25
18,10 17,40 +2,40 +2,00 +4,80 5,76 4,00
109,80 101,00 12,50 13,97 13,94


Thus, the correlation coefficient calculated in example 31 is r xy = +0.9. allows us to draw the following conclusions: there is a correlation between the magnitude of muscle strength of the right and left hands in the studied schoolchildren (coefficient r xy =+0.9 is different from zero), the relationship is very close (coefficient r xy =+0.9 is close to one), the correlation is direct (coefficient r xy = +0.9 is positive), i.e., with an increase in the muscle strength of one of the hands, the strength of the other hand increases.

When calculating the correlation coefficient and using its properties, it should be taken into account that the conclusions give correct results when the characteristics are normally distributed and when the relationship between a large number of values ​​of both characteristics is considered.

In the considered example 31, only 7 values ​​of both characteristics were analyzed, which, of course, is not enough for such studies. We remind you here once again that the examples in this book in general and in this chapter in particular are in the nature of illustrating methods, and not a detailed presentation of any scientific experiments. As a result, a small number of feature values ​​were considered, measurements were rounded - all this was done so that cumbersome calculations did not obscure the idea of ​​the method.

Particular attention should be paid to the essence of the relationship under consideration. The correlation coefficient cannot lead to correct research results if the relationship between characteristics is analyzed formally. Let us return once again to example 31. Both considered signs were the values ​​of muscle strength of the right and left hands. Let's imagine that by sign x i in example 31 (14.0; 14.2; 14.9... ...18.1) we mean the length of accidentally caught fish in centimeters, and by sign y i (12.1 ; 13.8; 14.2... ...17.4) - the weight of the instruments in the laboratory in kilograms. Having formally used the calculation apparatus to find the correlation coefficient and in this case also obtained r xy =+0>9, we had to conclude that there is a close direct relationship between the length of the fish and the weight of the instruments. The meaninglessness of such a conclusion is obvious.

To avoid a formal approach to using the correlation coefficient, one should use any other method - mathematical, logical, experimental, theoretical - to identify the possibility of the existence of a correlation between characteristics, that is, to discover the organic unity of characteristics. Only after this can one begin to use correlation analysis and establish the magnitude and nature of the relationship.

In mathematical statistics there is also the concept multiple correlation- relationships between three or more characteristics. In these cases, a multiple correlation coefficient is used, consisting of the paired correlation coefficients described above.

For example, the correlation coefficient of three characteristics - x i, y i, z i - is:

where R xyz is the multiple correlation coefficient, expressing how feature x i depends on features y i and z i;

r xy - correlation coefficient between characteristics x i and y i;

r xz - correlation coefficient between characteristics Xi and Zi;

r yz - correlation coefficient between features y i , z i

Correlation analysis is:

Correlation analysis

Correlation- statistical relationship between two or more random variables (or variables that can be considered as such with some acceptable degree of accuracy). Moreover, changes in one or more of these quantities lead to a systematic change in another or other quantities. A mathematical measure of the correlation between two random variables is the correlation coefficient.

The correlation can be positive and negative (it is also possible that there is no statistical relationship - for example, for independent random variables). Negative correlation - correlation, in which an increase in one variable is associated with a decrease in another variable, and the correlation coefficient is negative. Positive correlation - correlation, in which an increase in one variable is associated with an increase in another variable, and the correlation coefficient is positive.

Autocorrelation - statistical relationship between random variables from the same series, but taken with a shift, for example, for a random process - with a time shift.

The method of processing statistical data, which consists in studying the coefficients (correlation) between variables, is called correlation analysis.

Correlation coefficient

Correlation coefficient or pair correlation coefficient in probability theory and statistics, it is an indicator of the nature of the change in two random variables. The correlation coefficient is denoted by the Latin letter R and can take values ​​between -1 and +1. If the absolute value is closer to 1, then this means the presence of a strong connection (if the correlation coefficient is equal to one, we speak of a functional connection), and if it is closer to 0, then it is weak.

Pearson correlation coefficient

For metric quantities, the Pearson correlation coefficient is used, the exact formula of which was introduced by Francis Galton:

Let X,Y- two random variables defined on the same probability space. Then their correlation coefficient is given by the formula:

,

where cov denotes covariance and D is variance, or equivalently,

,

where the symbol denotes the mathematical expectation.

To graphically represent such a relationship, you can use a rectangular coordinate system with axes that correspond to both variables. Each pair of values ​​is marked with a specific symbol. This graph is called a “scatterplot.”

The method for calculating the correlation coefficient depends on the type of scale to which the variables belong. Thus, to measure variables with interval and quantitative scales, it is necessary to use the Pearson correlation coefficient (product moment correlation). If at least one of the two variables is on an ordinal scale or is not normally distributed, Spearman's rank correlation or Kendal's τ (tau) must be used. In the case where one of the two variables is dichotomous, a point-biserial correlation is used, and if both variables are dichotomous: a four-field correlation. Calculating the correlation coefficient between two non-dichotomous variables makes sense only when the relationship between them is linear (unidirectional).

Kendell correlation coefficient

Used to measure mutual disorder.

Spearman correlation coefficient

Properties of the correlation coefficient

  • Cauchy-Bunyakovsky inequality:
if we take covariance as the scalar product of two random variables, then the norm of the random variable will be equal to , and the consequence of the Cauchy-Bunyakovsky inequality will be: . , Where . Moreover, in this case the signs and k match up: .

Correlation analysis

Correlation analysis- method of processing statistical data, which consists in studying coefficients ( correlations) between variables. In this case, correlation coefficients between one pair or many pairs of characteristics are compared to establish statistical relationships between them.

Target correlation analysis- provide some information about one variable using another variable. In cases where it is possible to achieve a goal, the variables are said to be correlate. In its most general form, accepting the hypothesis of a correlation means that a change in the value of variable A will occur simultaneously with a proportional change in the value of B: if both variables increase, then the correlation is positive, if one variable increases and the other decreases, correlation is negative.

Correlation reflects only the linear dependence of values, but does not reflect their functional connectivity. For example, if you calculate the correlation coefficient between the quantities A = sin(x) And B = cos(x), then it will be close to zero, i.e. there is no dependence between the quantities. Meanwhile, quantities A and B are obviously related functionally according to the law sin 2(x) + cos 2(x) = 1.

Limitations of Correlation Analysis



Graphs of distributions of pairs (x,y) with the corresponding correlation coefficients x and y for each of them. Note that the correlation coefficient reflects a linear relationship (top line), but does not describe a relationship curve (middle line), and is not at all suitable for describing complex, nonlinear relationships (bottom line).
  1. Application is possible if there are a sufficient number of cases for study: for a particular type, the correlation coefficient ranges from 25 to 100 pairs of observations.
  2. The second limitation follows from the correlation analysis hypothesis, which includes linear dependence of variables. In many cases, when it is reliably known that a relationship exists, correlation analysis may not yield results simply because the relationship is nonlinear (expressed, for example, as a parabola).
  3. The mere fact of correlation does not provide grounds for asserting which of the variables precedes or causes changes, or that the variables are generally causally related to each other, for example, due to the action of a third factor.

Application area

This method of processing statistical data is very popular in economics and social sciences (in particular in psychology and sociology), although the scope of application of correlation coefficients is extensive: quality control of industrial products, metallurgy, agrochemistry, hydrobiology, biometrics and others.

The popularity of the method is due to two factors: correlation coefficients are relatively easy to calculate, and their use does not require special mathematical training. Combined with its ease of interpretation, the ease of application of the coefficient has led to its widespread use in the field of statistical data analysis.

False correlation

Often, the tempting simplicity of correlation research encourages the researcher to make false intuitive conclusions about the presence of a cause-and-effect relationship between pairs of characteristics, while correlation coefficients establish only statistical relationships.

Modern quantitative social science methodology has, in fact, abandoned attempts to establish cause-and-effect relationships between observed variables using empirical methods. Therefore, when researchers in the social sciences talk about establishing relationships between the variables being studied, either a general theoretical assumption or a statistical dependence is implied.

see also

  • Autocorrelation function
  • Cross-correlation function
  • Covariance
  • Determination coefficient
  • Regression analysis

Wikimedia Foundation. 2010.

Regression analysis allows you to evaluate how one variable depends on another and what is the spread of values ​​of the dependent variable around the straight line that defines the dependence. These estimates and corresponding confidence intervals predict the value of the dependent variable and determine the accuracy of that prediction.

The results of regression analysis can only be presented in a fairly complex digital or graphical form. However, we are often not interested in predicting the value of one variable based on the value of another, but simply in characterizing the closeness (strength) of the connection between them, expressed in one number.

This characteristic is called the correlation coefficient, usually denoted by the letter g. The correlation coefficient can

Can take values ​​from -1 to +1. The sign of the correlation coefficient shows the direction of the connection (direct or reverse), and the absolute value indicates the closeness of the connection. A coefficient equal to -1 defines a connection as strong as one equal to 1. In the absence of a connection, the correlation coefficient is zero.

In Fig. Figure 8.10 shows examples of dependencies and the corresponding values ​​of r. We will consider two correlation coefficients.

The Pearson correlation coefficient is intended to describe the linear relationship of quantitative traits; like regressions
onic analysis, it requires normal distribution. When people simply talk about the “correlation coefficient,” they almost always mean the Pearson correlation coefficient, which is exactly what we will do.

Spearman's rank correlation coefficient can be used when the relationship is nonlinear - and not only for quantitative, but also for ordinal characteristics. This is a non-parametric method and does not require any particular type of distribution.

We have already talked about quantitative, qualitative and ordinal characteristics in Chapter. 5. Quantitative characteristics are ordinary numerical data, such as height, weight, temperature. The values ​​of a quantitative characteristic can be compared with each other and it can be said which of them is greater, by how much and how many times. For example, if one Martian weighs 15 g and the other 10, then the first is heavier than the second by one and a half times and by 5 g. The values ​​of the ordinal feature can also be compared by saying which one is greater, but it is impossible to say by how much or in how many times. In medicine, ordinal signs are quite common. For example, the results of a vaginal Pap smear are assessed on the following scale: 1) normal, 2) mild dysplasia, 3) moderate dysplasia, 4) severe dysplasia, 5) cancer in situ. Both quantitative and ordinal characteristics can be arranged in order - a large group of nonparametric criteria is based on this general property, which includes the Spearman rank correlation coefficient. We will get acquainted with other nonparametric tests in Chap. 10.

Pearson correlation coefficient

And yet, why can’t regression analysis be used to describe the closeness of the connection? One could use the residual standard deviation as a measure of the strength of the relationship. However, if you swap the dependent and independent variables, the residual standard deviation, like other indicators of regression analysis, will be different.

Let's take a look at Fig. 8.11. Based on the sample of 10 Martians known to us, two regression lines were constructed. In one case, weight is a dependent variable, in the second it is an independent variable. The regression lines are noticeably different



20

If you swap x and y, the regression equation will be different, but the correlation coefficient will remain the same.

are worried. It turns out that the relationship between height and weight is one, and weight and height is another. The asymmetry of regression analysis is what prevents it from being directly used to characterize the strength of a connection. The correlation coefficient, although its idea stems from regression analysis, is free from this drawback. Here is the formula.

r Y(X - X)(Y - Y)

&((- X) S(y - Y)2"

where X and Y are the average values ​​of the variables X and Y. The expression for r is “symmetrical” - by swapping X and Y, we get the same value. The correlation coefficient takes values ​​from -1 to +1. The closer the connection, the greater the absolute value of the correlation coefficient. The sign shows the direction of the connection. When r > 0, we speak of a direct correlation (with an increase in one variable, the other also increases), when r Let's take the example with 10 Martians, which we have already considered from the point of view of regression analysis. Let's calculate the correlation coefficient. The initial data and intermediate results of calculations are given in table. 8.3. Sample size n = 10, average height

X = £ X/n = 369/10 = 36.9 and weight Y = £ Y/n = 103.8/10 = 10.38.

We find Ш- X)(Y- Y) = 99.9, Ш- X)2 = 224.8, £(Y - Y)2 = 51.9.

Let's substitute the obtained values ​​into the formula for the correlation coefficient:

224.8 x 51.9'"

The value of r is close to 1, which indicates a close relationship between height and weight. To better understand which correlation coefficient should be considered large and which insignificant, take a look at

Table 8.3. Calculation of the correlation coefficient
X Y X-X Y-Y(X-X)(Y-Y) (X -X)2 (Y-Y)2
31 7,8 -5,9 -2,6 15,3 34,8 6,8
32 8,3 -4,9 -2,1 10,3 24,0 4,4
33 7,6 -3,9 -2,8 10,9 15,2 7,8
34 9,1 -2,9 -1,3 3,8 8,4 1,7
35 9,6 -1,9 -0,8 1,5 3,6 0,6
35 9,8 -1,9 -0,6 1,1 3,6 0,4
40 11,8 3,1 1,4 4,3 9,6 2,0
41 12,1 4,1 1,7 7,0 16,8 2,9
42 14,7 5,1 4,3 22,0 26,0 18,5
46 13,0 9,1 2,6 23,7 82,8 6,8
369 103,8 0,0 0,2 99,9 224,8 51,9


those on the table 8.4 - it shows the correlation coefficients for the examples that we examined earlier.

Relationship between regression and correlation

We initially used all examples of correlation coefficients (Table 8.4) to construct regression lines. Indeed, there is a close relationship between the correlation coefficient and the parameters of regression analysis, which we will now demonstrate. The different ways of presenting the correlation coefficient that we will obtain will allow us to better understand the meaning of this indicator.

Recall that the regression equation is constructed in such a way as to minimize the sum of squared deviations from the regression line.


Let us denote this minimum sum of squares S (this quantity is called the residual sum of squares). Let us denote the sum of squared deviations of the values ​​of the dependent variable Y from its mean Y as S^. Then:

The quantity r2 is called the coefficient of determination - it is simply the square of the correlation coefficient. The coefficient of determination shows the strength of the connection, but not its direction.

From the above formula it is clear that if the values ​​of the dependent variable lie on the regression line, then S = 0, and thus r = +1 or r = -1, that is, there is a linear relationship between the dependent and independent variables. For any value of the independent variable, you can accurately predict the value of the dependent variable. On the contrary, if the variables are not related to each other at all, then Soci = SofSisi Then r = 0.

It can also be seen that the coefficient of determination is equal to that portion of the total variance S^ that is caused or, as they say, explained by linear regression.

The residual sum of squares S is related to the residual variance s2y\x by the relation Socj = (n - 2) s^, and the total sum of squares S^ with the variance s2 by the relation S^ = (n - 1)s2. In this case

r2 = 1 _ n _ 2 sy\x n _1 sy

This formula allows us to judge the dependence of the correlation coefficient on the share of residual variance in the total variance

six/s2y The smaller this share, the greater (in absolute value) the correlation coefficient, and vice versa.

We made sure that the correlation coefficient reflects the closeness of the linear relationship between the variables. However, if we are talking about predicting the value of one variable from the value of another,
The correlation coefficient should not be relied upon too much. For example, the data in Fig. 8.7 corresponds to a very high correlation coefficient (r = 0.92), however, the width of the confidence range shows that the prediction uncertainty is quite significant. Therefore, even with a large correlation coefficient, be sure to calculate the confidence range.


And finally, we present the ratio of the correlation coefficient and the slope coefficient of the direct regression b:

where b is the slope coefficient of the regression line, sx and sY are the standard deviations of the variables.

If we do not take into account the case sx = 0, then the correlation coefficient is zero if and only if b = 0. We will now use this fact to assess the statistical significance of the correlation.

Statistical significance of correlation

Since b = 0 implies r = 0, the hypothesis of no correlation is equivalent to the hypothesis of a zero slope of the regression line. Therefore, to assess the statistical significance of the correlation, we can use the formula we already know for assessing the statistical significance of the difference b from zero:

Here the number of degrees of freedom is v = n - 2. However, if the correlation coefficient has already been calculated, it is more convenient to use the formula:

The number of degrees of freedom here is also v = n - 2.

Despite the external dissimilarity of two formulas for t, they are identical. Indeed, from the fact that


r 2 _ 1 - n_ 2 Sy]x_

Substituting the value of sy^x into the formula for the standard error

Animal fat and breast cancer

Experiments on laboratory animals have shown that a high content of animal fat in the diet increases the risk of breast cancer. Is this dependence observed in people? K. Carroll collected data on animal fat consumption and breast cancer mortality for 39 countries. The result is shown in Fig. 8.12A. The correlation coefficient between animal fat consumption and breast cancer mortality was found to be 0.90. Let us evaluate the statistical significance of the correlation.

0,90 1 - 0,902 39 - 2

The critical value of t for the number of degrees of freedom v = 39 - 2 = 37 is equal to 3.574, which is less than what we obtained. Thus, at a significance level of 0.001, it can be stated that there is a correlation between the consumption of animal fats and mortality from breast cancer.

Now let's check whether mortality is associated with the consumption of vegetable fats? The corresponding data is shown in Fig. 8.12B. The correlation coefficient is 0.15. Then

1 - 0,152 39 - 2

Even at a significance level of 0.10, the calculated t value is less than the critical value. The correlation is not statistically significant.