Correlation analysis using the Spearman method. Rank correlation and Spearman's rank correlation coefficient

In cases where the measurements of the characteristics under study are carried out on an order scale, or the form of the relationship differs from linear, the study of the relationship between two random variables carried out using rank correlation coefficients. Let's consider the coefficient rank correlation Spearman. When calculating it, it is necessary to rank (order) the sample options. Ranking is the grouping of experimental data into in a certain order, either ascending or descending.

The ranking operation is carried out according to the following algorithm:

1. A lower value is assigned a lower rank. The highest value is assigned a rank corresponding to the number of ranked values. The smallest value is assigned a rank of 1. For example, if n=7, then highest value will receive rank number 7, except as provided in the second rule.

2. If several values are equal, then they are assigned a rank that is the average of the ranks they would receive if they were not equal. As an example, consider an ascending-ordered sample consisting of 7 elements: 22, 23, 25, 25, 25, 28, 30. The values 22 and 23 appear once each, so their ranks are respectively R22=1, and R23=2 . The value 25 appears 3 times. If these values were not repeated, then their ranks would be 3, 4, 5. Therefore, their R25 rank is equal to the arithmetic mean of 3, 4 and 5: . The values 28 and 30 are not repeated, so their ranks are respectively R28=6 and R30=7. Finally we have the following correspondence:

3. total amount ranks must coincide with the calculated one, which is determined by the formula:

where n is the total number of ranked values.

A discrepancy between the actual and calculated rank sums will indicate an error made when calculating ranks or summing them up. In this case, you need to find and fix the error.

Spearman's rank correlation coefficient is a method that allows one to determine the strength and direction of the relationship between two traits or two hierarchies of traits. The use of the rank correlation coefficient has a number of limitations:

a) The assumed correlation dependence must be monotonic.
b) The size of each sample must be greater than or equal to 5. To determine upper limit samples use tables of critical values (Appendix Table 3). Maximum value n in the table is 40.
c) During the analysis, it is likely that a large number of identical ranks may arise. In this case, an amendment must be made. The most favorable case is when both samples under study represent two sequences of divergent values.

To conduct a correlation analysis, the researcher must have two samples that can be ranked, for example:

- two characteristics measured in the same group of subjects;
- two individual hierarchies of traits identified in two subjects using the same set of traits;
- two group hierarchies of characteristics;
- individual and group hierarchies of characteristics.

We begin the calculation by ranking the studied indicators separately for each of the characteristics.

Let us analyze a case with two signs measured in the same group of subjects. First they rank individual values for the first characteristic, obtained by different subjects, and then individual values for the second characteristic. If lower ranks of one indicator correspond to lower ranks of another indicator, and higher ranks of one indicator correspond to greater ranks of another indicator, then the two characteristics are positively related. If higher ranks of one indicator correspond to lower ranks of another indicator, then the two characteristics are negatively related. To find rs, we determine the differences between the ranks (d) for each subject. The smaller the difference between the ranks, the closer the rank correlation coefficient rs will be to “+1”. If there is no relationship, then there will be no correspondence between them, hence rs will be close to zero. The greater the difference between the ranks of subjects on two variables, the closer to “-1” the value of the rs coefficient will be. Thus, the Spearman rank correlation coefficient is a measure of any monotonic relationship between the two characteristics under study.

Let us consider the case with two individual hierarchies of traits identified in two subjects using the same set of traits. In this situation, the individual values obtained by each of the two subjects are ranked according to a certain set of characteristics. The feature with the lowest value must be assigned the first rank; featured with more high value- second rank, etc. Should be paid Special attention to ensure that all characteristics are measured in the same units. For example, it is impossible to rank indicators if they are expressed in different “price” points, since it is impossible to determine which of the factors will take first place in terms of severity until all values are brought to a single scale. If signs having low ranks one of the subjects also has low ranks in the other, and vice versa, then the individual hierarchies are positively related.

In the case of two group hierarchies of characteristics, the average group values obtained in two groups of subjects are ranked according to the same set of characteristics for the studied groups. Next, we follow the algorithm given in previous cases.

Let us analyze a case with an individual and group hierarchy of characteristics. They begin by ranking separately the individual values of the subject and the average group values according to the same set of characteristics that were obtained, excluding the subject who does not participate in the average group hierarchy, since his individual hierarchy will be compared with it. Rank correlation allows us to assess the degree of consistency of the individual and group hierarchy of traits.

Let us consider how the significance of the correlation coefficient is determined in the cases listed above. In the case of two characteristics, it will be determined by the sample size. In the case of two individual feature hierarchies, the significance depends on the number of features included in the hierarchy. In two recent cases significance is determined by the number of characteristics studied, and not by the number of groups. Thus, the significance of rs in all cases is determined by the number of ranked values n.

When checking the statistical significance of rs, they use tables of critical values of the rank correlation coefficient compiled for various quantities ranked values and different levels significance. If absolute value rs reaches a critical value or exceeds it, then the correlation is reliable.

When considering the first option (a case with two signs measured in the same group of subjects), the following hypotheses are possible.

H0: The correlation between variables x and y is not different from zero.

H1: The correlation between variables x and y is significantly different from zero.

If we work with any of the three remaining cases, then it is necessary to put forward another pair of hypotheses:

H0: The correlation between hierarchies x and y is not different from zero.

H1: The correlation between hierarchies x and y is significantly different from zero.

The sequence of actions when calculating the Spearman rank correlation coefficient rs is as follows.

- Determine which two features or two hierarchies of features will participate in the comparison as variables x and y.
- Rank the values of the variable x, assigning a rank of 1 lowest value, in accordance with the ranking rules. Place the ranks in the first column of the table in order of test subjects or characteristics.
- Rank the values of the variable y. Place the ranks in the second column of the table in order of test subjects or characteristics.
- Calculate the differences d between the ranks x and y for each row of the table. Place the results in the next column of the table.
- Calculate the squared differences (d2). Place the resulting values in the fourth column of the table.
- Calculate the sum of squared differences? d2.
- If identical ranks occur, calculate the corrections:

where tx is the volume of each group of identical ranks in sample x;

ty is the volume of each group of identical ranks in sample y.

Calculate the rank correlation coefficient depending on the presence or absence of identical ranks. If there are no identical ranks, calculate the rank correlation coefficient rs using the formula:

If there are identical ranks, calculate the rank correlation coefficient rs using the formula:

where?d2 is the sum of squared differences between ranks;

Tx and Ty - corrections for equal ranks;

n is the number of subjects or features participating in the ranking.

Determine the critical values of rs from Appendix Table 3, for given quantity subjects n. A significant difference from zero of the correlation coefficient will be observed provided that rs is not less than the critical value.

Pearson correlation coefficient

Coefficient r- Pearson is used to study the relationship between two metric variables measured on the same sample. There are many situations in which its use is appropriate. Does intelligence affect academic performance in senior university years? Is the size of an employee's salary related to his friendliness towards colleagues? Does a student’s mood affect the success of solving a complex arithmetic problem? To answer similar questions the researcher must measure two indicators of interest for each member of the sample.

The value of the correlation coefficient is not affected by the units of measurement in which the characteristics are presented. Consequently, any linear transformations of features (multiplying by a constant, adding a constant) do not change the value of the correlation coefficient. An exception is the multiplication of one of the signs by a negative constant: the correlation coefficient changes its sign to the opposite.

Application of Spearman and Pearson correlation.

Pearson correlation is a measure of the linear relationship between two variables. It allows you to determine how proportional the variability of two variables is. If the variables are proportional to each other, then graphically the relationship between them can be represented as a straight line with a positive (direct proportion) or negative ( inverse proportion) tilt.

In practice, the relationship between two variables, if there is one, is probabilistic and graphically looks like an ellipsoidal dispersion cloud. This ellipsoid, however, can be represented (approximated) as a straight line, or regression line. A regression line is a straight line constructed using the method least squares: The sum of the squared distances (calculated along the Y axis) from each point on the scatter plot to the straight line is the minimum.

Special meaning to assess the accuracy of prediction has the variance of estimates of the dependent variable. Essentially, the variance of estimates of a dependent variable Y is that portion of its total variance that is due to the influence of the independent variable X. In other words, the ratio of the variance of estimates of the dependent variable to its true variance is equal to the square of the correlation coefficient.

The square of the correlation coefficient between the dependent and independent variables represents the proportion of variance in the dependent variable that is due to the influence of the independent variable and is called the coefficient of determination. The coefficient of determination thus shows the extent to which the variability of one variable is caused (determined) by the influence of another variable.

The determination coefficient has an important advantage over the correlation coefficient. Correlation is not a linear function of the relationship between two variables. Therefore, the arithmetic mean of the correlation coefficients for several samples does not coincide with the correlation calculated immediately for all subjects from these samples (i.e., the correlation coefficient is not additive). On the contrary, the coefficient of determination reflects the relationship linearly and is therefore additive: it can be averaged over several samples.

Additional information the strength of the connection is indicated by the value of the correlation coefficient squared - the coefficient of determination: this is the part of the variance of one variable that can be explained by the influence of another variable. Unlike the correlation coefficient, the coefficient of determination increases linearly with increasing connection strength.

Spearman correlation coefficients and τ - Kendall ( rank correlations )

If both variables between which the relationship is being studied are presented on an ordinal scale, or one of them is on an ordinal scale and the other on a metric scale, then rank correlation coefficients are used: Spearman or τ - Kendella. Both coefficients require a preliminary ranking of both variables for their application.

Spearman's rank correlation coefficient is a non-parametric method that is used to statistical study connections between phenomena. In this case, the actual degree of parallelism between the two is determined. quantitative series of the studied characteristics and an assessment of the closeness of the established connection is given using a quantitatively expressed coefficient.

If the members of a size group were ranked first on the x variable, then on the y variable, then the correlation between the x and y variables can be obtained simply by calculating the Pearson coefficient for the two series of ranks. Provided there are no rank relationships (i.e., no repeating ranks) for either variable, the Pearson formula can be greatly simplified computationally and converted into what is known as the Spearman formula.

The power of the Spearman rank correlation coefficient is somewhat inferior to the power of the parametric correlation coefficient.

It is advisable to use the rank correlation coefficient when there are a small number of observations. This method can be used not only for quantitative data, but also in cases where the recorded values are determined by descriptive features of varying intensity.

Spearman's rank correlation coefficient at large quantities equal ranks for one or both compared variables gives coarsened values. Ideally, both correlated series should represent two sequences of divergent values

An alternative to the Spearman correlation for ranks is the τ correlation - Kendall. The correlation proposed by M. Kendall is based on the idea that the direction of the connection can be judged by comparing subjects in pairs: if a pair of subjects has a change in x that coincides in direction with a change in y, then this indicates a positive connection, if does not match - then about a negative connection.

Correlation coefficients were specifically designed to quantify the strength and direction of the relationship between two properties measured on numerical scales (metric or rank). As already mentioned, the maximum strength of the connection corresponds to the correlation values +1 (strict direct or directly proportional connection) and -1 (strict inverse or inversely proportional connection); the absence of connection corresponds to a correlation equal to zero. Additional information about the strength of the relationship is provided by the coefficient of determination: this is the portion of the variance in one variable that can be explained by the influence of another variable.

9. Parametric Methods data comparison

Parametric comparison methods are used if your variables were measured on a metric scale.

Comparison of Variances 2- x samples according to Fisher's test . 

This method allows you to test the hypothesis that the variances of the 2 general populations from which the compared samples are extracted differ from each other. Limitations of the method - the distribution of the characteristic in both samples should not differ from normal.

An alternative to comparing variances is the Levene test, for which there is no need to test for normality of distribution. This method can be used to check the assumption of equality (homogeneity) of variances before checking the significance of differences in means using the Student's test for independent samples of different numbers.

- This quantification statistical study of the relationship between phenomena, used in nonparametric methods.

The indicator shows how the sum of squared differences between ranks obtained during observation differs from the case of no connection.

Purpose of the service. Using this online calculator you can:

calculation of Spearman's rank correlation coefficient;
calculation confidence interval for the coefficient and assessment of its significance;

Spearman's rank correlation coefficient refers to indicators for assessing the closeness of communication. The qualitative characteristic of the closeness of the connection of the rank correlation coefficient, as well as other correlation coefficients, can be assessed using the Chaddock scale.

Calculation of coefficient consists of the following steps:

Properties of Spearman's rank correlation coefficient

Application area. Rank correlation coefficient used to assess the quality of communication between two populations. Besides this, his statistical significance used when analyzing data for heteroskedasticity.

Example. Based on a sample of observed variables X and Y:

create a ranking table;
find Spearman's rank correlation coefficient and check its significance at level 2a
assess the nature of the dependence

Solution. Let's assign ranks to feature Y and factor X.

X	Y	rank X, d x	rank Y, d y
28	21	1	1
30	25	2	2
36	29	4	3
40	31	5	4
30	32	3	5
46	34	6	6
56	35	8	7
54	38	7	8
60	39	10	9
56	41	9	10
60	42	11	11
68	44	12	12
70	46	13	13
76	50	14	14

Rank matrix.

rank X, d x	rank Y, d y	(d x - d y) 2
1	1	0
2	2	0
4	3	1
5	4	1
3	5	4
6	6	0
8	7	1
7	8	1
10	9	1
9	10	1
11	11	0
12	12	0
13	13	0
14	14	0
105	105	10

Checking the correctness of the matrix based on the checksum calculation:

The sum of the columns of the matrix is equal to each other and the checksum, which means that the matrix is composed correctly.
Using the formula, we calculate the Spearman rank correlation coefficient.

The relationship between trait Y and factor X is strong and direct
Significance of Spearman's rank correlation coefficient
In order to test the null hypothesis at the significance level α that the general Spearman rank correlation coefficient is equal to zero under the competing hypothesis Hi. p ≠ 0, we need to calculate the critical point:

where n is the sample size; ρ - sample Spearman rank correlation coefficient: t(α, k) - critical point of the two-sided critical region, which is found from the table critical points Student's distribution, according to the significance level α and the number of degrees of freedom k = n-2.
If |p|< Т kp - нет оснований отвергнуть нулевую гипотезу. Ранговая корреляционная связь между качественными признаками не значима. Если |p| >T kp - the null hypothesis is rejected. There is a significant rank correlation between qualitative characteristics.
Using the Student's table we find t(α/2, k) = (0.1/2;12) = 1.782

Since T kp< ρ , то отклоняем гипотезу о равенстве 0 коэффициента ранговой корреляции Спирмена. Другими словами, коэффициент ранговой корреляции статистически - значим и ранговая корреляционная связь между оценками по двум тестам значимая.

Correlation analysis is a method that allows you to detect dependencies between a certain number of random variables. The purpose of correlation analysis is to identify an assessment of the strength of connections between such random variables or features that characterize certain real processes.

Today we propose to consider how Spearman correlation analysis is used to visually display the forms of communication in practical trading.

Spearman correlation or basis of correlation analysis

In order to understand what correlation analysis is, you first need to understand the concept of correlation.

At the same time, if the price starts to move in the direction you need, you need to unlock your positions in time.

For this strategy, which is based on correlation analysis, the best way suitable trading instruments having high degree correlations (EUR/USD and GBP/USD, EUR/AUD and EUR/NZD, AUD/USD and NZD/USD, CFD contracts and the like).

Video: Application of Spearman correlation in the Forex market

A psychology student (sociologist, manager, manager, etc.) is often interested in how two or large quantity variables in one or more study groups.

In mathematics, to describe the relationships between variable quantities, the concept of function F is used, which associates each specific value of the independent variable X specific value dependent variable Y. The resulting dependence is denoted as Y=F(X).

At the same time, the types of correlations between the measured characteristics can be different: for example, the correlation can be linear and nonlinear, positive and negative. It is linear - if with an increase or decrease in one variable X, the second variable Y, on average, either also increases or decreases. It is nonlinear if, with an increase in one quantity, the nature of the change in the second is not linear, but is described by other laws.

The correlation will be positive if, with an increase in the variable X, the variable Y on average also increases, and if, with an increase in X, the variable Y tends to decrease on average, then we speak of the presence of a negative correlation. It is possible that it is impossible to establish any relationship between variables. In this case, they say there is no correlation.

The task of correlation analysis comes down to establishing the direction (positive or negative) and form (linear, nonlinear) of the relationship between varying characteristics, measuring its closeness, and, finally, checking the level of significance of the obtained correlation coefficients.

The rank correlation coefficient, proposed by K. Spearman, refers to a nonparametric measure of the relationship between variables measured on a rank scale. When calculating this coefficient, no assumptions are required about the nature of the distributions of characteristics in population. This coefficient determines the degree of closeness of connection between ordinal characteristics, which in this case represent the ranks of the compared quantities.

Rank coefficient linear correlation Spearman is calculated using the formula:

where n is the number of ranked features (indicators, subjects);
D is the difference between the ranks for two variables for each subject;
D2 is the sum of squared differences of ranks.

The critical values of the Spearman rank correlation coefficient are presented below:

The value of Spearman's linear correlation coefficient lies in the range of +1 and -1. Spearman's linear correlation coefficient can be positive or negative, characterizing the direction of the relationship between two traits measured on a rank scale.

If the correlation coefficient in modulus turns out to be close to 1, then this corresponds to high level connections between variables. So, in particular, with correlation variable size with itself, the value of the correlation coefficient will be equal to +1. Such a relationship characterizes a directly proportional dependence. If the values of the X variable are arranged in ascending order, and the same values (now designated as the Y variable) are arranged in descending order, then in this case the correlation between the X and Y variables will be exactly -1. This value of the correlation coefficient characterizes an inversely proportional relationship.

The sign of the correlation coefficient is very important for interpreting the resulting relationship. If the sign of the linear correlation coefficient is plus, then the relationship between the correlated features is such that greater value One characteristic (variable) corresponds to a larger value of another characteristic (another variable). In other words, if one indicator (variable) increases, then the other indicator (variable) increases accordingly. This dependence is called directly proportional dependence.

If a minus sign is received, then a larger value of one characteristic corresponds to a smaller value of another. In other words, if there is a minus sign, an increase in one variable (sign, value) corresponds to a decrease in another variable. This dependence is called inversely proportional dependence. In this case, the choice of the variable to which the character (tendency) of increase is assigned is arbitrary. It can be either variable X or variable Y. However, if variable X is considered to increase, then variable Y will correspondingly decrease, and vice versa.

Let's look at the example of Spearman correlation.

The psychologist finds out how individual indicators of readiness for school, obtained before the start of school among 11 first-graders, are related to each other and their average performance at the end of the school year.

To solve this problem, firstly, the values of the indicators were ranked school readiness received upon admission to school, and, secondly, the final performance indicators at the end of the year for these same students on average. We present the results in the table:

We substitute the obtained data into the above formula and perform the calculation. We get:

To find the level of significance, we refer to the table “Critical values of the Spearman rank correlation coefficient,” which shows the critical values for the rank correlation coefficients.

We construct the corresponding “axis of significance”:

The resulting correlation coefficient coincided with critical value for a significance level of 1%. Consequently, it can be argued that the indicators of school readiness and the final grades of first-graders are connected by a positive correlation - in other words, the higher the indicator of school readiness, the better the first-grader studies. In terms statistical hypotheses the psychologist must reject the null (H0) hypothesis about similarities and accept the alternative (H1) about the presence of differences, which suggests that the relationship between indicators of school readiness and average academic performance is different from zero.

Spearman correlation. Correlation analysis using the Spearman method. Spearman ranks. Spearman correlation coefficient. Spearman rank correlation