Spearman correlation calculations. Spearman and Kendal rank correlation coefficients

Spearman rank correlation(rank correlation). Spearman's rank correlation is the simplest way to determine the degree of relationship between factors. The name of the method indicates that the relationship is determined between ranks, that is, series of obtained quantitative values, ranked in descending or ascending order. It must be borne in mind that, firstly, rank correlation is not recommended if the connection between pairs is less than four and more than twenty; secondly, rank correlation makes it possible to determine the relationship in another case, if the values ​​are semi-quantitative in nature, that is, they do not have a numerical expression and reflect a clear order of occurrence of these values; thirdly, it is advisable to use rank correlation in cases where it is sufficient to obtain approximate data. An example of calculating the rank correlation coefficient to determine the question: the questionnaire measures X and Y similar personal qualities of the subjects. Using two questionnaires (X and Y), which require alternative answers “yes” or “no,” the primary results were obtained - the answers of 15 subjects (N = 10). The results were presented as the sum of affirmative answers separately for questionnaire X and for questionnaire B. These results are summarized in table. 5.19.

Table 5.19. Tabulation of primary results to calculate the Spearman rank correlation coefficient (p) *

Analysis of the summary correlation matrix. Method of correlation galaxies.

Example. In table Figure 6.18 shows interpretations of eleven variables that are tested using the Wechsler method. Data were obtained from a homogeneous sample aged 18 to 25 years (n = 800).

Before stratification, it is advisable to rank the correlation matrix. To do this, the average values ​​of the correlation coefficients of each variable with all the others are calculated in the original matrix.

Then according to the table. 5.20 determine the acceptable levels of stratification of the correlation matrix with a given confidence probability of 0.95 and n - quantities

Table 6.20. Ascending correlation matrix

Variables 1 2 3 4 would 0 7 8 0 10 11 M(rij) Rank
1 1 0,637 0,488 0,623 0,282 0,647 0,371 0,485 0,371 0,365 0,336 0,454 1
2 1 0,810 0,557 0,291 0,508 0,173 0,486 0,371 0,273 0,273 0,363 4
3 1 0,346 0,291 0,406 0,360 0,818 0,346 0,291 0,282 0,336 7
4 1 0,273 0,572 0,318 0,442 0,310 0,318 0,291 0,414 3
5 1 0,354 0,254 0,216 0,236 0,207 0,149 0,264 11
6 1 0,365 0,405 0,336 0,345 0,282 0,430 2
7 1 0,310 0,388 0,264 0,266 0,310 9
8 1 0,897 0,363 0,388 0,363 5
9 1 0,388 0,430 0,846 6
10 1 0,336 0,310 8
11 1 0,300 10

Designations: 1 - general awareness; 2 - conceptuality; 3 - attentiveness; 4 - vdataness K of generalization; b - direct memorization (in numbers) 6 - level of mastery of the native language; 7 - speed of mastering sensorimotor skills (symbol coding) 8 - observation; 9 - combinatorial abilities (for analysis and synthesis) 10 - ability to organize parts into a meaningful whole; 11 - ability for heuristic synthesis; M (rij) - the average value of the correlation coefficients of the variable with other observation variables (in our case n = 800): r (0) - the value of the zero "Dissecting" plane - the minimum significant absolute value of the correlation coefficient (n - 120, r (0) = 0.236; n = 40, r (0) = 0.407) | Δr | - permissible stratification step (n = 40, | Δr | = 0.558) in - permissible number of stratification levels (n = 40, s = 1; n = 120, s = 2); r (1), r (2), ..., r (9) - absolute value of the cutting plane (n = 40, r (1) = 0.965).

For n = 800, we find the value of gtype and the boundaries of gi, after which we stratify the correlation matrix, highlighting correlation pleiades within the layers, or separate parts of the correlation matrix, drawing associations of correlation pleiades for overlying layers (Fig. 5.5).

A meaningful analysis of the resulting galaxies goes beyond the limits of mathematical statistics. It should be noted that there are two formal indicators that help with the meaningful interpretation of the Pleiades. One significant indicator is the degree of a vertex, that is, the number of edges adjacent to a vertex. The variable with the largest number of edges is the “core” of the galaxy and can be considered as an indicator of the remaining variables of this galaxy. Another significant indicator is communication density. A variable may have fewer connections in one galaxy, but closer, and more connections in another galaxy, but less close.

Predictions and estimates. The equation y = b1x + b0 is called the general equation of the line. It indicates that pairs of points (x, y), which

Rice. 5.5. Correlation galaxies obtained by matrix layering

lie on a certain line, connected in such a way that for any value x, the value b in paired with it can be found by multiplying x by a certain number b1 and adding secondly, the number b0 to this product.

The regression coefficient allows you to determine the degree of change in the investigative factor when the causal factor changes by one unit. Absolute values ​​characterize the relationship between variable factors by their absolute values. The regression coefficient is calculated using the formula:

Design and analysis of experiments. Design and analysis of experiments is the third important branch of statistical methods developed to find and test causal relationships between variables.

To study multifactorial dependencies, methods of mathematical experimental design have recently been increasingly used.

The ability to simultaneously vary all factors allows you to: a) reduce the number of experiments;

b) reduce experimental error to a minimum;

c) simplify the processing of received data;

d) ensure clarity and ease of comparison of results.

Each factor can acquire a certain corresponding number of different values, which are called levels and denoted -1, 0 and 1. A fixed set of factor levels determines the conditions of one of the possible experiments.

The totality of all possible combinations is calculated using the formula:

A complete factorial experiment is an experiment in which all possible combinations of factor levels are implemented. Full factorial experiments can have the property of orthogonality. With orthogonal planning, the factors in the experiment are uncorrelated; the regression coefficients that are ultimately calculated are determined independently of each other.

An important advantage of the method of mathematical experimental planning is its versatility and suitability in many areas of research.

Let's consider an example of comparing the influence of some factors on the formation of the level of mental stress in color TV controllers.

The experiment is based on an orthogonal Design 2 three (three factors change at two levels).

The experiment was carried out with a complete part 2 + 3 with three repetitions.

Orthogonal planning is based on the construction of a regression equation. For three factors it looks like this:

Processing of the results in this example includes:

a) construction of an orthogonal plan 2 +3 table for calculation;

b) calculation of regression coefficients;

c) checking their significance;

d) interpretation of the obtained data.

For the regression coefficients of the mentioned equation, it was necessary to put N = 2 3 = 8 options in order to be able to assess the significance of the coefficients, where the number of repetitions K was 3.

The matrix for planning the experiment looked like this:

In cases where the measurements of the characteristics under study are carried out on an order scale, or the form of the relationship differs from linear, the study of the relationship between two random variables is carried out using rank correlation coefficients. Consider the Spearman rank correlation coefficient. When calculating it, it is necessary to rank (order) the sample options. Ranking is the grouping of experimental data in a certain order, either ascending or descending.

The ranking operation is carried out according to the following algorithm:

1. A lower value is assigned a lower rank. The highest value is assigned a rank corresponding to the number of ranked values. The smallest value is assigned a rank of 1. For example, if n=7, then the largest value will receive a rank of 7, except in cases provided for in the second rule.

2. If several values ​​are equal, then they are assigned a rank that is the average of the ranks they would receive if they were not equal. As an example, consider an ascending-ordered sample consisting of 7 elements: 22, 23, 25, 25, 25, 28, 30. The values ​​22 and 23 appear once each, so their ranks are respectively R22=1, and R23=2 . The value 25 appears 3 times. If these values ​​were not repeated, then their ranks would be 3, 4, 5. Therefore, their R25 rank is equal to the arithmetic mean of 3, 4 and 5: . The values ​​28 and 30 are not repeated, so their ranks are respectively R28=6 and R30=7. Finally we have the following correspondence:

3. The total sum of ranks must coincide with the calculated one, which is determined by the formula:

where n is the total number of ranked values.

A discrepancy between the actual and calculated rank sums will indicate an error made when calculating ranks or summing them up. In this case, you need to find and fix the error.

Spearman's rank correlation coefficient is a method that allows one to determine the strength and direction of the relationship between two traits or two hierarchies of traits. The use of the rank correlation coefficient has a number of limitations:

  • a) The assumed correlation dependence must be monotonic.
  • b) The volume of each sample must be greater than or equal to 5. To determine the upper limit of the sample, use tables of critical values ​​(Table 3 of the Appendix). The maximum value of n in the table is 40.
  • c) During the analysis, it is likely that a large number of identical ranks may arise. In this case, an amendment must be made. The most favorable case is when both samples under study represent two sequences of divergent values.

To conduct a correlation analysis, the researcher must have two samples that can be ranked, for example:

  • - two characteristics measured in the same group of subjects;
  • - two individual hierarchies of traits identified in two subjects using the same set of traits;
  • - two group hierarchies of characteristics;
  • - individual and group hierarchies of characteristics.

We begin the calculation by ranking the studied indicators separately for each of the characteristics.

Let us analyze a case with two signs measured in the same group of subjects. First, the individual values ​​obtained by different subjects are ranked according to the first characteristic, and then the individual values ​​are ranked according to the second characteristic. If lower ranks of one indicator correspond to lower ranks of another indicator, and higher ranks of one indicator correspond to greater ranks of another indicator, then the two characteristics are positively related. If higher ranks of one indicator correspond to lower ranks of another indicator, then the two characteristics are negatively related. To find rs, we determine the differences between the ranks (d) for each subject. The smaller the difference between the ranks, the closer the rank correlation coefficient rs will be to “+1”. If there is no relationship, then there will be no correspondence between them, hence rs will be close to zero. The greater the difference between the ranks of subjects on two variables, the closer to “-1” the value of the rs coefficient will be. Thus, the Spearman rank correlation coefficient is a measure of any monotonic relationship between the two characteristics under study.

Let us consider the case with two individual hierarchies of traits identified in two subjects using the same set of traits. In this situation, the individual values ​​obtained by each of the two subjects are ranked according to a certain set of characteristics. The feature with the lowest value must be assigned the first rank; the characteristic with a higher value is the second rank, etc. Particular care should be taken to ensure that all attributes are measured in the same units. For example, it is impossible to rank indicators if they are expressed in different “price” points, since it is impossible to determine which of the factors will take first place in terms of severity until all values ​​are brought to a single scale. If features that have low ranks in one of the subjects also have low ranks in another, and vice versa, then the individual hierarchies are positively related.

In the case of two group hierarchies of characteristics, the average group values ​​obtained in two groups of subjects are ranked according to the same set of characteristics for the studied groups. Next, we follow the algorithm given in previous cases.

Let us analyze a case with an individual and group hierarchy of characteristics. They begin by ranking separately the individual values ​​of the subject and the average group values ​​according to the same set of characteristics that were obtained, excluding the subject who does not participate in the average group hierarchy, since his individual hierarchy will be compared with it. Rank correlation allows us to assess the degree of consistency of the individual and group hierarchy of traits.

Let us consider how the significance of the correlation coefficient is determined in the cases listed above. In the case of two characteristics, it will be determined by the sample size. In the case of two individual feature hierarchies, the significance depends on the number of features included in the hierarchy. In the last two cases, significance is determined by the number of characteristics being studied, and not by the number of groups. Thus, the significance of rs in all cases is determined by the number of ranked values ​​n.

When checking the statistical significance of rs, tables of critical values ​​of the rank correlation coefficient are used, compiled for different numbers of ranked values ​​and different levels of significance. If the absolute value of rs reaches or exceeds a critical value, then the correlation is reliable.

When considering the first option (a case with two signs measured in the same group of subjects), the following hypotheses are possible.

H0: The correlation between variables x and y is not different from zero.

H1: The correlation between variables x and y is significantly different from zero.

If we work with any of the three remaining cases, then it is necessary to put forward another pair of hypotheses:

H0: The correlation between hierarchies x and y is not different from zero.

H1: The correlation between hierarchies x and y is significantly different from zero.

The sequence of actions when calculating the Spearman rank correlation coefficient rs is as follows.

  • - Determine which two features or two hierarchies of features will participate in the comparison as variables x and y.
  • - Rank the values ​​of the variable x, assigning rank 1 to the smallest value, in accordance with the ranking rules. Place the ranks in the first column of the table in order of test subjects or characteristics.
  • - Rank the values ​​of the variable y. Place the ranks in the second column of the table in order of test subjects or characteristics.
  • - Calculate the differences d between the ranks x and y for each row of the table. Place the results in the next column of the table.
  • - Calculate the squared differences (d2). Place the resulting values ​​in the fourth column of the table.
  • - Calculate the sum of squared differences? d2.
  • - If identical ranks occur, calculate the corrections:

where tx is the volume of each group of identical ranks in sample x;

ty is the volume of each group of identical ranks in sample y.

Calculate the rank correlation coefficient depending on the presence or absence of identical ranks. If there are no identical ranks, calculate the rank correlation coefficient rs using the formula:

If there are identical ranks, calculate the rank correlation coefficient rs using the formula:

where?d2 is the sum of squared differences between ranks;

Tx and Ty - corrections for equal ranks;

n is the number of subjects or features participating in the ranking.

Determine the critical values ​​of rs from Appendix Table 3 for a given number of subjects n. A significant difference from zero of the correlation coefficient will be observed provided that rs is not less than the critical value.

A psychology student (sociologist, manager, manager, etc.) is often interested in how two or more variables are related to each other in one or more groups being studied.

In mathematics, to describe the relationships between variable quantities, the concept of a function F is used, which associates each specific value of the independent variable X with a specific value of the dependent variable Y. The resulting dependence is denoted as Y=F(X).

At the same time, the types of correlations between the measured characteristics can be different: for example, the correlation can be linear and nonlinear, positive and negative. It is linear - if with an increase or decrease in one variable X, the second variable Y, on average, either also increases or decreases. It is nonlinear if, with an increase in one quantity, the nature of the change in the second is not linear, but is described by other laws.

The correlation will be positive if, with an increase in the variable X, the variable Y on average also increases, and if, with an increase in X, the variable Y tends to decrease on average, then we speak of the presence of a negative correlation. It is possible that it is impossible to establish any relationship between variables. In this case, they say there is no correlation.

The task of correlation analysis comes down to establishing the direction (positive or negative) and form (linear, nonlinear) of the relationship between varying characteristics, measuring its closeness, and, finally, checking the level of significance of the obtained correlation coefficients.

The rank correlation coefficient, proposed by K. Spearman, refers to a nonparametric measure of the relationship between variables measured on a rank scale. When calculating this coefficient, no assumptions are required about the nature of the distributions of characteristics in the population. This coefficient determines the degree of closeness of connection between ordinal characteristics, which in this case represent the ranks of the compared quantities.

Spearman's rank linear correlation coefficient is calculated using the formula:

where n is the number of ranked features (indicators, subjects);
D is the difference between the ranks for two variables for each subject;
D2 is the sum of squared differences of ranks.

The critical values ​​of the Spearman rank correlation coefficient are presented below:

The value of Spearman's linear correlation coefficient lies in the range of +1 and -1. Spearman's linear correlation coefficient can be positive or negative, characterizing the direction of the relationship between two traits measured on a rank scale.

If the correlation coefficient in absolute value is close to 1, then this corresponds to a high level of connection between the variables. So, in particular, when a variable is correlated with itself, the value of the correlation coefficient will be equal to +1. Such a relationship characterizes a directly proportional dependence. If the values ​​of the X variable are arranged in ascending order, and the same values ​​(now designated as the Y variable) are arranged in descending order, then in this case the correlation between the X and Y variables will be exactly -1. This value of the correlation coefficient characterizes an inversely proportional relationship.

The sign of the correlation coefficient is very important for interpreting the resulting relationship. If the sign of the linear correlation coefficient is plus, then the relationship between the correlating features is such that a larger value of one feature (variable) corresponds to a larger value of another feature (another variable). In other words, if one indicator (variable) increases, then the other indicator (variable) increases accordingly. This dependence is called a directly proportional dependence.

If a minus sign is received, then a larger value of one characteristic corresponds to a smaller value of another. In other words, if there is a minus sign, an increase in one variable (sign, value) corresponds to a decrease in another variable. This dependence is called inversely proportional dependence. In this case, the choice of the variable to which the character (tendency) of increase is assigned is arbitrary. It can be either variable X or variable Y. However, if variable X is considered to increase, then variable Y will correspondingly decrease, and vice versa.

Let's look at the example of Spearman correlation.

The psychologist finds out how individual indicators of readiness for school, obtained before the start of school among 11 first-graders, are related to each other and their average performance at the end of the school year.

To solve this problem, we ranked, firstly, the values ​​of indicators of school readiness obtained upon admission to school, and, secondly, the final indicators of academic performance at the end of the year for these same students on average. We present the results in the table:

We substitute the obtained data into the above formula and perform the calculation. We get:

To find the level of significance, we refer to the table “Critical values ​​of the Spearman rank correlation coefficient,” which shows the critical values ​​for the rank correlation coefficients.

We construct the corresponding “axis of significance”:

The resulting correlation coefficient coincided with the critical value for the significance level of 1%. Consequently, it can be argued that the indicators of school readiness and the final grades of first-graders are connected by a positive correlation - in other words, the higher the indicator of school readiness, the better the first-grader studies. In terms of statistical hypotheses, the psychologist must reject the null (H0) hypothesis of similarity and accept the alternative (H1) of differences, which suggests that the relationship between indicators of school readiness and average academic performance is different from zero.

Spearman correlation. Correlation analysis using the Spearman method. Spearman ranks. Spearman correlation coefficient. Spearman rank correlation

The calculator below calculates the Spearman rank correlation coefficient between two random variables. The theoretical part, in order not to be distracted from the calculator, is traditionally placed under it.

add import_export mode_edit delete

Changes in random variables

arrow_upwardarrow_downward Xarrow_upwardarrow_downward Y
Page Size: 5 10 20 50 100 chevron_left chevron_right

Changes in random variables

Import data Import error

You can use one of these symbols to separate fields: Tab, ";" or "," Example: -50.5;-50.5

Import Back Cancel

The method for calculating the Spearman rank correlation coefficient is actually described very simply. This is the same Pearson correlation coefficient, only calculated not for the results of measurements of random variables themselves, but for their rank values.

That is,

All that remains is to figure out what rank values ​​are and why all this is needed.

If the elements of a variation series are arranged in ascending or descending order, then rank element will be its number in this ordered series.

For example, let us have a variation series (17,26,5,14,21). Let's sort its elements in descending order (26,21,17,14,5). 26 has rank 1, 21 has rank 2, etc. The variation series of rank values ​​will look like this (3,1,5,4,2).

That is, when calculating the Spearman coefficient, the original variation series are transformed into variation series of rank values, after which the Pearson formula is applied to them.

There is one subtlety - the rank of repeated values ​​is taken as the average of the ranks. That is, for the series (17, 15, 14, 15) the series of rank values ​​will look like (1, 2.5, 4, 2.5), since the first element equal to 15 has rank 2, and the second one has rank 3, and .

If there are no repeating values, that is, all values ​​of the rank series are numbers from the range from 1 to n, the Pearson formula can be simplified to

Well, by the way, this formula is most often given as a formula for calculating the Spearman coefficient.

What is the essence of the transition from the values ​​themselves to their rank values?
The point is that by studying the correlation of rank values, you can determine how well the dependence of two variables is described by a monotonic function.

The sign of the coefficient indicates the direction of the relationship between the variables. If the sign is positive, then Y values ​​tend to increase as X values ​​increase; if the sign is negative, then the Y values ​​tend to decrease as the X values ​​increase. If the coefficient is 0, then there is no trend. If the coefficient is 1 or -1, then the relationship between X and Y has the form of a monotonic function - that is, as X increases, Y also increases, or vice versa, as X increases, Y decreases.

That is, unlike the Pearson correlation coefficient, which can only reveal a linear dependence of one variable on another, the Spearman correlation coefficient can reveal a monotonic dependence where a direct linear relationship is not detected.

Let me explain with an example. Let's assume that we are examining the function y=10/x.
We have the following X and Y measurements
{{1,10}, {5,2}, {10,1}, {20,0.5}, {100,0.1}}
For these data, the Pearson correlation coefficient is -0.4686, that is, the relationship is weak or absent. But the Spearman correlation coefficient is strictly equal to -1, which seems to hint to the researcher that Y has a strict negative monotonic dependence on X.