Validity of qualitative methods. Method quality criteria: validity, reliability Assessment of the validity of qualitative methods in psychology

After reliability, another key criterion for assessing the quality of methods is validity. The question of the validity of methods is resolved only after its sufficient reliability has been established, since an unreliable method without knowledge of its validity is practically useless.

It should be noted that the question of validity until recently seems to be one of the most difficult. The most established definition of this concept is the one given in the book by A. Anastasi: “Test validity is a concept that tells us what the test measures and how well it does it” (1982, p. 126). Validity at its core is a complex characteristic that includes, on the one hand, information about whether the technique is suitable for measuring what it was created for, and on the other hand, what its effectiveness and efficiency are. For this reason, there is no single universal approach to determining validity. Depending on which aspect of validity the researcher wants to consider, different methods of evidence are used. In other words, the concept of validity includes its different types, which have their own special meaning. Checking the validity of a methodology is called validation.

Validity in its first understanding is related to the methodology itself, that is, it is the validity of the measuring instrument. This type of testing is called theoretical validation. Validity in the second understanding refers not so much to the methodology as to the purpose of its use. This is pragmatic validation.

So, during theoretical validation, the researcher is interested in the property itself measured by the technique. This essentially means that psychological validation itself is being carried out. With pragmatic validation, the essence of the subject of measurement (psychological property) is out of sight. The main emphasis is on proving that the “something” measured by the technique has a connection with certain areas of practice.

Conducting theoretical validation, as opposed to pragmatic validation, sometimes turns out to be much more difficult. Without going into specific details for now, let us dwell in general terms on how pragmatic validity is checked: some external criterion, independent of the methodology, is selected that determines success in a particular activity (educational, professional, etc.), and with it The results of the diagnostic technique are compared. If the connection between them is considered satisfactory, then a conclusion is drawn about the practical effectiveness and efficiency of the diagnostic technique.

To determine theoretical validity, it is much more difficult to find any independent criterion that lies outside the methodology. Therefore, in the early stages of the development of testology, when the concept of validity was just taking shape, there was an intuitive idea that the test measures:

1) the methodology was recognized as valid, since what it measures is simply “obvious”; 2)

the proof of validity was based on the researcher's confidence that his method allows him to “understand the subject”; 3)

the technique was considered valid (i.e., the statement was accepted that such and such a test measures such and such a quality) only because the theory on the basis of which the technique was based was “very good.”

Acceptance of unfounded statements about the validity of the methodology could not continue for a long time. The first manifestations of truly scientific criticism debunked this approach: the search for scientifically based evidence began.

As already mentioned, to carry out theoretical validation of a technique is to show whether the technique really measures exactly the property, the quality that it, according to the researcher, should measure. So, for example, if some test was developed in order to diagnose the mental development of schoolchildren, it is necessary to analyze whether it really measures this development, and not some other characteristics (for example, personality, character, etc.). Thus, for theoretical validation, the cardinal problem is the relationship between mental phenomena and their indicators, through which these mental phenomena are attempted to be cognized. It shows that the author’s intention and the results of the methodology coincide.

It is not so difficult to carry out theoretical validation of a new technique if there is already a technique with known, proven validity for measuring a given property. The presence of a correlation between a new and a similar old technique indicates that the developed technique measures the same psychological quality as the reference one. And if the new method at the same time turns out to be more compact and economical in carrying out and processing the results, then psychodiagnosticians have the opportunity to use a new tool instead of the old one. This technique is especially often used in differential psychophysiology when creating methods for diagnosing the basic properties of the human nervous system (see Chapter VII).

But theoretical validity is proven not only by comparison with related indicators, but also with those where, based on the hypothesis, there should be no significant connections. Thus, to check theoretical validity, it is important, on the one hand, to establish the degree of connection with a related technique (convergent validity) and the absence of this connection with techniques that have a different theoretical basis (discriminant validity).

It is much more difficult to carry out theoretical validation of a technique when such a path is impossible. Most often, this is the situation a researcher faces. In such circumstances, only the gradual accumulation of various information about the property being studied, the analysis of theoretical premises and experimental data, and significant experience in working with the technique make it possible to reveal its psychological meaning.

An important role in understanding what the methodology measures is played by comparing its indicators with practical forms of activity. But here it is especially important that the methodology be carefully worked out theoretically, i.e. so that there is a solid, well-founded scientific basis. Then, when comparing the methodology with that taken from

everyday practice, by an external criterion corresponding to what it measures, information can be obtained that supports theoretical ideas about its essence.

It is important to remember that if theoretical validity is proven, then the interpretation of the obtained indicators becomes clearer and more unambiguous, and the name of the technique corresponds to the scope of its application.

As for pragmatic validation, it involves testing a methodology in terms of its practical effectiveness, significance, and usefulness. It is given great importance, especially where the question of selection arises. The development and use of diagnostic techniques makes sense only when there is a reasonable assumption that the quality being measured is manifested in certain life situations, in certain types of activities.

If we again turn to the history of the development of testology (A Anastasi, 1982; B.S. Avanesov, 1982; K.M. Gurevich, 1970; “General psychodiagnostics”, 1987; B.M. Teplov, 1985, etc.), then we can highlight such a period (20 -30s), when the scientific content of the tests and their theoretical “baggage” were of less interest. It was important that the test “work” and help quickly select the most prepared people. The empirical criterion for assessing test tasks was considered the only correct guideline in solving scientific and applied problems.

The use of diagnostic techniques with purely empirical justification, without a clear theoretical basis, often led to pseudoscientific conclusions and unjustified practical recommendations. It was impossible to name exactly the abilities and qualities that the tests revealed. B.M. Teplov, analyzing the tests of that period, called them “blind tests” (1985).

This approach to the problem of test validity was typical until the early 50s. not only in the USA, but also in other countries. The theoretical weakness of empirical validation methods could not but arouse criticism from those scientists who, in the development of tests, called for relying not only on “bare” empirics and practice, but also on a theoretical concept. Practice without theory, as we know, is blind, and theory without practice is dead. Currently, theoretical and pragmatic assessment of the validity of methods is perceived as the most productive.

To carry out pragmatic validation of the methodology, i.e. To assess its effectiveness, efficiency, and practical significance, an independent external criterion is usually used - an indicator of the manifestation of the property being studied in everyday life. Such a criterion can be academic performance (for tests of learning abilities, tests of achievements, tests of intelligence), production achievements (for methods of professional orientation), the effectiveness of real activities - drawing, modeling, etc. (for special ability tests), subjective assessments (for personality tests).

American researchers Tiffin and McCormick (1968), after analyzing the external criteria used to prove the validity, identify four types:

1) performance criteria (these may include such as the amount of work completed, academic performance, time spent on training, growth rate

qualifications, etc.);

2) subjective criteria (they include various types of answers that reflect a person’s attitude towards something or someone, his opinion, views, preferences; usually subjective criteria are obtained using interviews, questionnaires, questionnaires);

3) physiological criteria (they are used to study the influence of the environment and other situational variables on the human body and psyche; pulse rate, blood pressure, electrical resistance of the skin, symptoms of fatigue, etc. are measured);

4) criteria of accidents (applied when the purpose of the study concerns, for example, the problem of selecting for work such persons who are less susceptible to accidents).

The external criterion must meet three basic requirements:

it must be relevant, free from contamination and reliable.

Relevance refers to the semantic correspondence of a diagnostic tool to an independent vital criterion. In other words, there must be confidence that the criterion involves precisely those features of the individual psyche that are measured by the diagnostic technique. The external criterion and the diagnostic technique must be in internal semantic correspondence with each other, and be qualitatively homogeneous in psychological essence (K.M. Gurevich, 1985). If, for example, a test measures individual characteristics of thinking, the ability to perform logical actions with certain objects and concepts, then the criterion should also look for the manifestation of precisely these skills. This equally applies to professional activities. It has not one, but several goals and objectives, each of which is specific and imposes its own conditions for implementation. This implies the existence of several criteria for performing professional activities. Therefore, success in diagnostic techniques should not be compared with production efficiency in general. It is necessary to find a criterion that, based on the nature of the operations performed, is correlated with the methodology.

If it is unknown regarding an external criterion whether it is relevant to the property being measured or not, then comparing the results of a psychodiagnostic technique with it becomes practically useless. It does not allow one to come to any conclusions that could assess the validity of the methodology.

The requirements for freedom from contamination are caused by the fact that, for example, educational or industrial success depends on two variables: on the person himself, his individual characteristics, measured by methods, and on the situation, study and work conditions, which can introduce interference and “contaminate” the applied criterion . To avoid this to some extent, groups of people who are in more or less identical conditions should be selected for research. Another method can be used. It consists of correcting the influence of interference. This adjustment is usually statistical in nature. Thus, for example, productivity should not be taken in absolute terms, but in relation to the average productivity of workers working under similar conditions.

When they say that a criterion must have statistically significant reliability, this means that it must reflect the constancy and stability of the function being studied.

The search for an adequate and easily identified criterion is a very important and complex task of validation. In Western testing, many methods are disqualified only because it was not possible to find a suitable criterion for testing them. For example, most questionnaires have questionable validity data because it is difficult to find an adequate external criterion that corresponds to what they measure.

Assessment of the validity of the methodology can be quantitative and qualitative.

To calculate a quantitative indicator - the validity coefficient - the results obtained when applying the diagnostic technique are compared with the data obtained using an external criterion for the same individuals. Different types of linear correlation are used (according to Spearman, according to Pearson).

How many subjects are needed to calculate validity? Practice has shown that there should not be less than 50, but more than 200 is best. The question often arises, what should the value of the validity coefficient be for it to be considered acceptable? In general, it is noted that it is sufficient for the validity coefficient to be statistically significant. A validity coefficient of about 0.20-0.30 is considered low, average - 0.30-0.50 and high - over 0.60.

But, as A. Anastasi (1982) emphasizes, K.M. Gurevich (1970) and others, it is not always legitimate to use linear correlation to calculate the validity coefficient. This technique is justified only when it is proven that success in some activity is directly proportional to success in performing a diagnostic test. The position of foreign testologists, especially those involved in professional suitability and selection, most often comes down to the unconditional recognition that the one who has completed more tasks in the test is more suitable for the profession. But it may also be that to succeed in an activity you need to have a property at the level of 40% of the test solution. Further success in the test no longer has any significance for the profession. A clear example from the monograph of KM Gurevich: a postman must be able to read, but whether he reads at normal speed or at a very high speed - this no longer has professional significance. With such a correlation between the indicators of the method and the external criterion, the most adequate way to establish validity may be the criterion of differences.

Another case is also possible: a higher level of property than required by the profession interferes with professional success. So F. Taylor found that the most developed female production workers have low labor productivity. That is, their high level of mental development prevents them from working highly productively. In this case, analysis of variance or calculation of correlation relationships would be more suitable for calculating the validity coefficient.

As the experience of foreign testologists has shown, not a single statistical procedure is able to fully reflect the diversity of individual assessments. Therefore, another model is often used to prove the validity of methods - clinical assessments. This is nothing more than a qualitative description of the essence of what is being studied

properties. In this case, we are talking about the use of techniques that do not rely on statistical processing.

There are several types of validity, due to the characteristics of diagnostic techniques, as well as the temporary status of the external criterion. In many works (A Anastasi, 1982; L.F. Burlachuk, SM. Morozov, 1989; KM. Gurevich, 1970; B.V. Kulagin, 1984; B. Cherny, 1983; “General Psychodiagnostics”, 1987, etc.) the following are most often called: 1.

Content validity. This technique is used primarily in achievement tests. Typically, achievement tests do not include all the material that students have covered, but some small part of it (3-4 questions). Can you be sure that the correct answers to these few questions indicate that you have mastered all the material? This is what a content validity test should answer. To do this, a comparison of success on the test with expert assessments of teachers (based on this material) is carried out. Content validity also applies to criterion-referenced tests. This technique is sometimes called logical validity. 2.

Concurrent validity, or ongoing validity, is determined by an external criterion in which information is collected at the same time as the experimentation of the procedure being tested. In other words, data is collected relating to present performance during the test period, performance during the same period, etc. The results of success on the test are correlated with it.

“Predictive” validity (another name is “predictive” validity). It is also determined by a fairly reliable external criterion, but information on it is collected some time after the test. An external criterion is usually a person’s ability, expressed in some kind of assessment, for the type of activity for which he was selected based on the results of diagnostic tests. Although this technique is most consistent with the task of diagnostic techniques - predicting future success, it is very difficult to apply. The accuracy of the forecast is inversely related to the time specified for such forecasting. The more time passes after measurement, the greater the number of factors that need to be taken into account when assessing the prognostic significance of the technique. However, it is almost impossible to take into account all the factors influencing the prediction. 4.

Natural science and humanities paradigm in psychology

The entire history of the development of psychology can be characterized as the relationship between two opposing approaches - the natural sciences and the humanities, and in recent decades there has been a gradual displacement of the first by the second. Initially, Aristotle argued that the study of the soul is the work of the natural scientist. The current state of affairs can be characterized as a crisis of attempts to build psychology on the model of natural science. The presence of separate sections of psychology that cannot be attributed to the natural science line (psychoanalysis, humanistic psychology, logotherapy) only aggravates the crisis state.

But in Russian psychology at present, according to V.I. Slobodchikova and E.I. Isaev, the prevailing orientation is still towards natural science, towards objectivity, towards measurement and experiment as the ideal of scientificity. Soviet psychology developed as an academic, scientistic discipline. In recent years, humanistic psychology has begun to take shape within the framework of psychological practice. The need to create a special psychotechnical theory has been realized, i.e. theory that substantiates human science and psychological practice. Essentially, this means the creation of humanistic psychology as an alternative to natural science academic psychology.

V.N. Surkov notes that attempts by psychologists to meet natural science standards in the field of interaction between theory and experiment have led to a “positivist overstrain” in psychology. Psychologists’ defensive reaction to the pressure of “positivist rituals” is the widespread use of “shadow methodology” (the tradition of formulating hypotheses after conducting research, deriving them from the data obtained and not from theories, selecting only “convenient” empirical data, etc.). .

The main reasons preventing the establishment of psychology as a natural science are:

o the spiritual nature of human origin, which does not allow us to consider him as an object of first nature or a mechanism;

o human reflexivity and activity; the impossibility of just controlling a person - an understanding position, love, help, support are organic in relation to a person.

These reasons are similar to the specific characteristics of humanitarian knowledge, because a person acts as a spiritual value, and not just as an “object of research.” The main goal of psychology is to understand another person, explain a certain spiritual or cultural phenomenon, and give meaning. The reflexive nature of psychological knowledge is manifested in the mutual influence of the subject and object of knowledge; The focus of psychology presupposes not just understanding, but an active dialogue between the researcher and the object being studied.

Thus, the application of the requirements of the natural science standard in psychology is limited. According to numerous authors, even a psychological experiment, not to mention the entire complex of psychological knowledge, should be built according to humanitarian canons.

Conclusion

A significant part of the discussion about the scientific status of psychology is connected not so much with a discussion of the question of whether psychology is a science, but with the question of what standard (natural science or humanitarian) it should be guided by (and what criteria of scientific character it should meet).

Foreign psychologists are more inclined to perceive psychology within the framework of humanitarian psychology, while domestic psychologists still attach less importance to humanitarian knowledge in psychology compared to natural science. But the trend in recent years is still the humanization of knowledge of psychic reality. As many authors rightly point out, the acquisition of psychological knowledge should be based on a humanitarian paradigm, but to prove indisputable facts, the natural science paradigm is used, i.e. both paradigms in the study of psychic reality are necessary.

But, according to most psychologists, from the point of view of the prospects for research activity, it is within the framework of the humanitarization of knowledge that truly complex tasks are determined, which are a worthy challenge for the scientific community.

Validity criteria applied to qualitative research.

Validity of psychodiagnostic techniques

After reliability, another key criterion for assessing the quality of methods is validity. The question of the validity of a technique is resolved only after its sufficient reliability has been established, since an unreliable technique cannot be valid. But the most reliable technique without knowledge of its validity is practically useless.

It should be noted that the question of validity until recently seems to be one of the most difficult. The most established definition of this concept is the one given in the book by A. Anastasi: “Test validity is a concept that tells us what the test measures and how well it does it.”

Validity at its core is a complex characteristic that includes, on the one hand, information about whether the technique is suitable for measuring what it was created for, and on the other hand, what its effectiveness, efficiency, and practical usefulness are.

For this reason, there is no single universal approach to determining validity. Depending on which aspect of validity the researcher wants to consider, different methods of evidence are used. In other words, the concept of validity includes its different types, which have their own special meaning. Checking the validity of a methodology is called validation.

Validity in its first understanding is related to the methodology itself, i.e. it is the validity of the measuring instrument. This type of testing is called theoretical validation. Validity in the second understanding refers not so much to the methodology as to the purpose of its use. This is pragmatic validation.

To summarize, we can say the following:

During theoretical validation, the researcher is interested in the property itself measured by the technique. This essentially means that psychological validation itself is being carried out;

With pragmatic validation, the essence of the subject of measurement (psychological property) is out of sight. The main emphasis is on proving that something measured by a technique has a relationship with certain areas of practice.

Conducting theoretical validation, as opposed to pragmatic validation, sometimes turns out to be much more difficult. Without going into specific details for now, let us dwell in general terms on how pragmatic validity is checked: some external criterion, independent of the methodology, is selected that determines success in a particular activity (educational, professional, etc.), and with it The results of the diagnostic technique are compared. If the connection between them is considered satisfactory, then a conclusion is drawn about the practical significance, effectiveness, and efficiency of the diagnostic technique.

To determine theoretical validity, it is much more difficult to find any independent criterion that lies outside the methodology. Therefore, in the early stages of the development of testology, when the concept of validity was just taking shape, there was an intuitive idea that the test measures:

1) the technique was called valid, since what it measures is simply obvious; 2) the proof of validity was based on the researcher's confidence that his method allows him to understand the subject; 3) the technique was considered valid (i.e., the statement was accepted that such and such a test measures such and such a quality) only because the theory on the basis of which the technique was based was very good.

Acceptance of unfounded statements about the validity of the methodology could not continue for a long time. The first manifestations of truly scientific criticism debunked this approach: the search for scientifically based evidence began.

Thus, to carry out theoretical validation of a methodology is to prove that the methodology measures exactly the property, the quality, which the researcher intended it to measure.

So, for example, if some test was developed in order to diagnose the mental development of children, it is necessary to analyze whether it really measures this development, and not some other characteristics (for example, personality, character, etc.). Thus, for theoretical validation, the cardinal problem is the relationship between psychological phenomena and their indicators through which these psychological phenomena are sought to be known. This shows how much the author’s intentions and the results of the methodology coincide.

It is not so difficult to carry out theoretical validation of a new technique if there is already a technique with proven validity for measuring a given property. The presence of a correlation between a new and a similar already tested method indicates that the developed method measures the same psychological quality as the reference one. And if the new method at the same time turns out to be more compact and economical in carrying out and processing the results, then psychodiagnosticians have the opportunity to use a new tool instead of the old one.

But theoretical validity is proven not only by comparison with related indicators, but also with those where, based on the hypothesis, there should not be significant connections. Thus, to check theoretical validity, it is important, on the one hand, to establish the degree of connection with a related technique (convergent validity) and the absence of this connection with techniques that have a different theoretical basis (discriminant validity).

It is much more difficult to carry out theoretical validation of a method when such a verification method is impossible. Most often, this is the situation a researcher faces. In such circumstances, only the gradual accumulation of various information about the property being studied, the analysis of theoretical premises and experimental data, and significant experience in working with the technique make it possible to reveal its psychological meaning.

An important role in understanding what the methodology measures is played by comparing its indicators with practical forms of activity. But here it is especially important that the methodology be carefully worked out theoretically, that is, that there is a solid, well-founded scientific basis. Then, by comparing the technique with an external criterion taken from everyday practice that corresponds to what it measures, information can be obtained that supports theoretical ideas about its essence.

It is important to remember that if theoretical validity is proven, then the interpretation of the obtained indicators becomes clearer and more unambiguous, and the name of the technique corresponds to the scope of its application. As for pragmatic validation, it involves testing a technique from the point of view of its practical effectiveness, significance, and usefulness, since it makes sense to use a diagnostic technique only when it is proven that the property being measured is manifested in certain life situations, in certain types of activities. It is given great importance especially where the question of selection arises.

If we again turn to the history of the development of testology, we can highlight a period (20-30s of the 20th century) when the scientific content of tests and their theoretical baggage were of less interest. It was important that the test worked and helped quickly select the most prepared people. The empirical criterion for assessing test tasks was considered the only correct guideline in solving scientific and applied problems.

The use of diagnostic techniques with purely empirical justification, without a clear theoretical basis, often led to pseudoscientific conclusions and unjustified practical recommendations. It was impossible to accurately name the features and qualities that the tests revealed. They were essentially blind tests.

This approach to the problem of test validity was typical until the early 50s. XX century not only in the USA, but also in other countries. The theoretical weakness of empirical validation methods could not but arouse criticism from those scientists who, in the development of tests, called for relying not only on bare empirics and practice, but also on a theoretical concept. Practice without theory, as we know, is blind, and theory without practice is dead. Currently, theoretical and practical assessment of the validity of methods is perceived as the most productive.

To conduct pragmatic validation of a technique, i.e., to assess its effectiveness, efficiency, and practical significance, an independent external criterion is usually used - an indicator of the manifestation of the property being studied in everyday life. Such a criterion can be academic performance (for tests of learning abilities, achievement tests, intelligence tests), and production achievements (for professional-oriented methods), and the effectiveness of real activities - drawing, modeling, etc. (for tests of special abilities), subjective assessments (for personality tests).

American researchers D. Tiffin and E. McCormick, having analyzed the external criteria used to prove the validity, identify four types:

1) performance criteria (these may include such as the amount of work completed, academic performance, time spent on training, rate of growth of qualifications, etc.); 2) subjective criteria (they include various types of answers that reflect a person’s attitude towards something or someone, his opinion, views, preferences; usually subjective criteria are obtained using interviews, questionnaires, questionnaires); 3) physiological criteria (they are used to study the influence of the environment and other situational variables on the human body and psyche; pulse rate, blood pressure, electrical resistance of the skin, symptoms of fatigue, etc. are measured); 4) criteria of accidents (applied when the purpose of the study concerns, for example, the problem of selecting for work such persons who are less susceptible to accidents).

The external criterion must meet three basic requirements:

It must be relevant;

Free from interference;

Reliable.

Relevance refers to the semantic correspondence of a diagnostic tool to an independent vital criterion. In other words, there must be confidence that the criterion involves precisely those features of the individual psyche that are measured by the diagnostic technique. The external criterion and the diagnostic technique must be in internal semantic correspondence with each other and be qualitatively homogeneous in psychological essence. If, for example, a test measures individual characteristics of thinking, the ability to perform logical actions with certain objects and concepts, then the criterion should also look for the manifestation of precisely these skills. This equally applies to professional activities. It has not one, but several goals and objectives, each of which is specific and imposes its own conditions for implementation. This implies the existence of several criteria for performing professional activities. Therefore, success in diagnostic techniques should not be compared with production efficiency in general. It is necessary to find a criterion that, based on the nature of the operations performed, is correlated with the methodology.

If it is unknown regarding an external criterion whether it is relevant to the property being measured or not, then comparing the results of a psychodiagnostic technique with it becomes practically useless. It does not allow one to come to any conclusions that could assess the validity of the methodology.

The requirements for freedom from interference are caused by the fact that, for example, educational or industrial success depends on two variables: on the person himself, his individual characteristics, measured by methods, and on the situation, study and work conditions, which can introduce interference and “contaminate” the applied criterion . To avoid this to some extent, groups of people who are in more or less identical conditions should be selected for research. Another method can be used. It consists of correcting the influence of interference. This adjustment is usually statistical in nature. For example, productivity should not be taken in absolute terms, but in relation to the average productivity of workers working under similar conditions

When they say that a criterion must have statistically significant reliability, this means that it must reflect the constancy and stability of the function being studied.

The search for an adequate and easily identified criterion is a very important and complex task of validation. In Western testing, many methods are disqualified only because it was not possible to find a suitable criterion for testing them. For example, most questionnaires have questionable validity data because it is difficult to find an adequate external criterion that corresponds to what they measure.

Assessment of the validity of methods can be quantitative and qualitative.

To calculate a quantitative indicator - the validity coefficient - the results obtained when applying the diagnostic technique are compared with the data obtained using an external criterion for the same individuals. Different types of linear correlation are used (according to Spearman, according to Pearson).

How many subjects are needed to calculate validity?

Practice has shown that there should not be less than 50, but more than 200 is best. The question often arises: what should the value of the validity coefficient be in order for it to be considered acceptable? In general, it is noted that it is sufficient for the validity coefficient to be statistically significant. A validity coefficient of about 0.20-0.30 is considered low, average - 0.30-0.50 and high - over 0.60.

But, as A. Anastasi, K. M. Gurevich and others emphasize, it is not always legitimate to use linear correlation to calculate the validity coefficient. This technique is justified only when it is proven that success in some activity is directly proportional to success in performing a diagnostic test. The position of foreign testologists, especially those involved in professional suitability and selection, most often comes down to the unconditional recognition that the one who has completed more tasks in the test is more suitable for the profession. But it may also be that to succeed in an activity you need to have a property at the level of 40% of the test solution. Further success in the test no longer has any significance for the profession. A clear example from the monograph by K. M. Gurevich: a postman must be able to read, but whether he reads at normal speed or at very high speed - this no longer has professional significance. With such a correlation between the indicators of the method and the external criterion, the most adequate way to establish validity may be the criterion of differences.

Another case is also possible: a higher level of property than required by the profession interferes with professional success. So, even at the dawn of the 20th century. American researcher F. Taylor found that the most developed female production workers have low labor productivity. That is, their high level of mental development prevented them from working highly productively. In this case, analysis of variance or calculation of correlation relationships would be more suitable for calculating the validity coefficient.

As the experience of foreign testologists has shown, not a single statistical procedure is able to fully reflect the diversity of individual assessments. Therefore, another model is often used to prove the validity of methods - clinical assessments. This is nothing more than a qualitative description of the essence of the property being studied. In this case, we are talking about the use of techniques that do not rely on statistical processing.

7. Concepts of reliability, validity, reliability of the test according to A.G. Shmelev.

Test properties

What other important implications can we draw from the test-weapon metaphor? This metaphor allows us to more accurately and deeply understand a number of instrumental requirements for tests that tests must meet, as well as the standards for using tests. I am not at all going to list here all the psychometric properties of tests, but still some of the most important ones are worth mentioning - at least not strictly, at least purely metaphorically.

1) Test reliability. Can a weapon made in a makeshift semi-basement workshop, as they say, “on the knees”, be reliable? This weapon will shoot anywhere - sometimes at the target, but more often sideways, and sometimes it can simply explode in the hands of the shooter. Here it is appropriate to remind the following: reliable tests are not created in tiny laboratories (and especially not at a desk by a lone author). The reliability of the test is not only checked on a representative (mass) sample, but simply cannot be developed without extensive statistics. A representative sample for test standardization is a kind of testing ground for new weapons. Only after such field tests can the test designer make targeted (“sighted”) adjustments to the original design of his weapon. Thus, already in the example of this one property of the test - reliability - we see what? The metaphor “test-weapon” gives us in this context. A bad weapon does not strengthen, but, on the contrary, weakens the user and puts him at risk. But is it possible to judge the quality of weapons in general by samples of handicraft weapons? It's not tests in general that are bad, but unreliable tests.

2) Test validity. Let us recall that this is a measure of the suitability of the test for the purposes of psychodiagnostics, a measure of compliance with the property being measured. Where will the weapon shoot? This depends not only on the reliability of the test itself, but also on the user. An unreliable test cannot be valid. This axiom of the theory of measurement in this context is easy to understand: if you do not hit the silhouette with five steps, then what kind of validity, what kind of correspondence of the test to the property being measured can we talk about, because with the help of such a “test” you can not hit the enemy, and into “yours” - the one who is standing next to you, that is, you “catch” with the help of the test not the target, but another mental property. But if the shooter himself is blind, if he is colorblind, who does not distinguish the colors of the uniforms in which his own and those of others are dressed, if he is also an alarmist, then in a panic he will fire even from reliable small arms at both his own and those of others. Thus, we can easily formulate an important consequence: the test cannot be valid in the hands of a non-professional. Here is another axiom of testology, which, alas, can be so difficult to explain not only to a mass audience, but also to psychologists themselves, because with the words “reliability” and “validity” terrible and incomprehensible psychometric formulas float into their minds. Therefore, these concepts seem to them more mathematical than psychological, that is, alien to their “humanitarian intellect.”

Again, in this context, let us return to criticism of tests. Is it possible to judge the test, and especially tests in general, if even quite high-quality factory weapons are handed over to panicked recruits who either shoot sparrows from a cannon (for example, use a heavy IQ battery like the Wechsler test to diagnose attention deficit disorder), or rush in with a pistol it is vain to fire at an armored tank (they are trying to understand the nature and meaningful meaning of the internal conflict by color preferences in the Luscher test, which, in my opinion, is suitable only for a rough assessment of the mood background). Any person more or less knowledgeable in military affairs understands as two and two: there is no universal weapon and in different battle conditions it is necessary to use different things. But the human psyche is a more subtle reality, invisible to outsiders, than the battlefield. And so we confuse everything in the world: sluggish positional firefight, active artillery barrage and a furious full-length bayonet attack, when it’s time to pull grenades from our belts. When you're doing a very brief sample of a few tasks (a few hidden figures from the Gottschald test, a few Rorschach inkblots), you should still be aware that you are just as likely to stumble upon diagnostically valuable information as which one can hit a steel bunker with a light infantry grenade. Most likely there will be no result! But should we then conclude that all tests are ineffective? I would say that many single psychological tests are a very weak weapon against well-camouflaged fortifications, against the defense in depth of the multi-story human psyche, which by the time of social maturity develops many layers of very sophisticated psychological defense mechanisms. Here we come to the problem of reliability - the problem of the relationship between conscious and unconscious mechanisms of psychological defense against testing. R. Cattell once called this the problem of motivational distortions. It sounds beautiful, although we are talking about ugly things - about more or less conscious lies.

3) Credibility. This is a problem of falsification. In this context, let us formulate the following somewhat paradoxical professional and ethical standard: “The subject has the right to lie.” In fact, if a test is a weapon of penetration into the human psyche, then the subject has the right to self-defense - to resist this penetration. In the end, it is possible to justify a test subject who managed to hide his problems, his defects, by mobilizing for a socially desirable test: in this way, at the time of testing, he demonstrates the strength of his compensatory mechanisms, the ability to solve problems for moral development, the ability to solve problems for intellectual development and etc. 5, although, perhaps, in everyday life he behaves differently. The strength of the armored hull of his ship, which ensured its unsinkability, turned out to be stronger than the blow that the psychologist dealt with his weapon. Honor and praise to such a subject. But this thesis also has an important consequence: positive test results have less value and less predictive power than negative results.

Thus, if we finally understand the basic ideas about the essence of the test, we will learn to adequately apply it in social practice. As long as we misinterpret the essence of the test and do not adequately see the limitations in the practice of its use, we make serious mistakes. Is it necessary to ban the proliferation of weapons in a society where no one really knows how to use them competently? Apparently, it would still be wiser not to ban it at all, but rather to limit it to a narrower circle of trained, certified users! And they should be provided only with certified tools, and not just any random ones. If unfortunate builders erect multi-story buildings on swamps or quicksand without laying a solid foundation, i.e., they violate all the rules of safe construction technology, then the building should not be built in this way at all; This does not mean that architectural institutes, all factories producing building materials, and construction organizations themselves should be banned. If someone misuses certain medications by turning them into drugs, this does not mean that the pharmaceutical industry should be banned, although the strictness in controlling the distribution of dangerous drugs will, of course, have to be increased.

Tests and expert assessments

In my opinion, standardized tests do not provide a basis for a final positive diagnosis (i.e., a diagnosis of suitability for a certain activity); for this, they must be supplemented by expert assessments (or other less standardized diagnostic procedures, including expert assessments to one degree or another , as, for example, happens in projective techniques).

Thus, a positive outcome of a test trial is a logically necessary, but not sufficient condition for a final positive conclusion. Since I, as a testologist, unfortunately, am well aware that our fellow citizens sometimes have serious problems with elementary logic 5, let us schematize what has been said in the form of the following tablet:

Let us explain this with a meaningful example. First, let's take the most trivial case, far from psychology - the already mentioned exam on knowledge of traffic rules. If the candidate passes the test according to the rules, then he cannot yet be issued a license - he must then pass a less formal practical driving test. If the candidate fails the test, he is not allowed to take the next test. In this context, it’s time to also make the following disclaimer: a negative test result is not a death sentence. Everyone understands that you can learn the rules, come again and retake the exam.

Let us now take a less obvious (not yet formalized by regulation) procedure for testing a candidate when applying for a job for the level of so-called “corporate loyalty”. Let us assume that the subject is presented with a completely primitive test-questionnaire containing straightforward questions like “Have you ever deceived teachers when taking exams at school?” As we said above, the subject in this case uses his right to falsify and answers “That’s right, I haven’t.” And what conclusion do we draw in this case? No! But if the subject suddenly, in a fit of frankness, answers, “Wrong, it happened,” then at least one should be wary.

This principle applies to an even greater extent to basic tests of basic professional knowledge. If an accountant candidate cannot answer a question in a competitive test questionnaire about what a “chart of accounts” is, then should we continue to work with this candidate? Should the expensive time of qualified experts be spent interviewing such a candidate in detail? Of course not 6.

Thus, I propose literally everywhere, in all branches of practice, to use the test as a primary cheap and formalized filter, preceding the use of more complex and expensive expert procedures. To some extent, personnel assessment specialists who use the Assessment Center technology are currently guided by a similar logic.

So the above plate should be modified to look like this:

Positive outcome of expert assessment Negative outcome of the expert assessment
Positive test result Conclusion of suitability Conclusion of unsuitability
Negative test result Conclusion of unsuitability Conclusion of unsuitability

As we see, for a positive general conclusion, a conjunction (logical “AND”) of two independent events is required - a positive test outcome and a positive outcome of the expert assessment. The absence of at least one of the positive outcomes does not make it possible to draw a general positive conclusion.

The quality of such a two-filter selection system is in any case higher than any single-filter system - based only on expert assessments or only on tests. And the talk that in our country test results are very easy to buy (alas, such talk was often started, for example, on the discussion forum of the Unified State Exam portal ege.edu.ru) is either deliberately demagogic in nature, or again reveals a defect in the logical thinking. Where you can buy test results, as a rule, you can also buy the results of an expert assessment, and you also need to specifically study which of the filters is actually less salable. Even if keys are leaked even if the test is widely distributed, a negative test outcome still retains its value, but it is especially important that incorruptible experts come into play after a positive outcome. If we connect the results of two procedures with a logical “AND”, then the numerical results of the test and expert assessment are more correct not to sum, but to multiply, that is, to aggregate not additively, but multiplicatively:

where T is the test result, E is the result of the expert assessment, O is the overall score. If any of the factors takes a zero value (is below the minimum threshold), then the overall result turns out to be zero regardless of the value of the second factor. With non-zero values ​​of both components of the formula, the maximum result is achieved if the values ​​of T and E are close to each other. ?! Where does this come from? And what does the amount have to do with it? This approach somewhat neutralizes the effect of overestimating one indicator due to its “purchase”.


Related information.


In the broad sense of the word, validity, i.e. the validity of a method, means the connection between the empirical data obtained with its help and the main goals of the study. The question of the validity of qualitative methods in previous years was greatly confused by specialists in mathematical statistics, who extended very specific statistical criteria of validity to classes of problems and research situations that had nothing in common with ideal objects such as multi-colored balls taken out of a basket, with which probability theory operates.

Before moving on to describing qualitative research, especially group research, it is necessary to describe how it differs from quantitative research. To understand these differences more fully, it is extremely important to understand what, strictly speaking, would be a “mistake” of the study.

Quantitative sociological research will be a type of research based on the mathematical theory of probability. Among the axiomatic premises of this theory there is a very important premise that the differences between the analyzed objects are limited to a fixed set of discrete characteristics. For example, the balls lying in the basket differ in color, size and numbers drawn on them. People, however, can differ in their demographic features, attitudes, etc., and it is important to note that in any given questionnaire, the set of features is limited by the number of quantified questions in the questionnaire, and all other possible features are assumed to be identical.

The main criterion characterizing a statistical type study will be reliability, i.e., reproducibility of the results obtained. If you conduct a repeat survey using the same methodology in the same social group, and the results of both surveys are identical, it means they are reliable. Today, no one disputes the fact that with a correctly conducted mass representative survey using formalized questionnaires, a high degree of reproducibility of results is automatically achieved. At the same time, the question of their validity is far from exhausted by the data.

In mathematical sociology, the validity of a study is usually interpreted as the degree to which the means of measurement corresponds to what was to be measured. The dictionary further explains that, in the strict sense of the word, validation is possible only in the presence of an independent external criterion, but such a situation would be rare in sociology. In all other cases, the validity of the results of quantitative surveys will be nothing more than a hypothesis, the assessment of the degree of likelihood of which has nothing to do with mathematical and statistical procedures. The low degree of credibility of many implicit substantive hypotheses latently embedded by researchers in the wording and structure of formalized questions, and sometimes the complete absence of such credibility, will be a very serious and poorly understood problem.

Thus, the statistical reliability of quantitative research results should not be confused with their reliability and validity in the broad sense of the word. Strictly speaking, quantitative research is reliable only to the extent that the problem of reliability itself can be reduced to its statistical interpretation. If such reduction fails or is impossible in principle, quantitative data becomes an extremely unreliable basis for conclusions.

When comparing quantitative and qualitative methods from the standpoint of their validity, it should first of all be noted that the areas of their valid application do not coincide with each other. This makes a generalized comparison of them based on validity criteria meaningless. There are classes of problems in which quantitative methods have high, and qualitative methods have low validity. At the same time, there are - and this aspect is usually poorly emphasized even in the specialized literature - other classes of problems in which the indicated relationship is directly opposite.

The purpose of our textbook is not to consider issues of methodology of qualitative methods in general. The specificity of focus groups, as well as individual in-depth interviews, if they are conducted in large series, is essentially that, at least theoretically, statistical validity criteria are also applicable to them, although different from those in quantitative research.

Note that text transcripts of a series of group interviews conducted on a specific topic form an array of primary data of several hundred pages. This array is quite suitable for analysis using statistical methods, both in terms of its size and heterogeneity. The heterogeneity of the array is ensured by the participation of several dozen respondents, which already gives grounds for the approximate distribution of similar answers on a three- or five-member scale: a clear minority, a minority, approximately equally, a majority, a clear majority. The main thing, however, is not the matter. The specificity of the primary data array of group interviews is essentially that:

1. The unit of analysis will not be the respondent, but the utterance. Since each respondent will be the bearer of many statements, ϶ᴛᴏ at least increases the array of primary analytical units by an order of magnitude, making it statistically significant.

2. The task of qualitative research does not include determining the number or proportion of carriers of a particular position in society or its segment. In relation to this class of problems, qualitative methods are invalid.

The task of qualitative methods will be to form a list of so-called “existence hypotheses,” i.e., a list of opinions, assessments or statements that exist in society and, presumably, have a non-zero degree of distribution. In this case, as D. notes, we note that Templeton, it is preferable to make a mistake by identifying a non-existent or insignificant factor than to miss a highly significant one.

The mathematical apparatus adapted for solving problems of this type is, in principle, well known. It is worth noting that it is used in linguistics when compiling lists of sounds and syllables, as well as frequency dictionaries of words and phrases. The same apparatus is also used in sociological research carried out using content analysis. In relation to the latter case, the mathematical formulation of the problem looks something like this: “There is a presidential candidate A, who is written about in the newspapers. It is required to compile as complete a list of epithets as possible with which the authors of the articles characterize this candidate. How much newspaper texts should be studied so that with a 95% probability the number of unidentified epithets does not exceed 5%?

Like the vast majority of applied statistical problems, this problem cannot be solved without certain preliminary knowledge about the nature of the frequency distribution of the desired epithets, as well as without certain a priori assumptions. Taking into account the dependence on the practical convenience of choosing one or another system of assumptions, the formulation of the problem itself may vary. Deepening into this issue is beyond the scope of our topic, since in applied research carried out using the focus group method, a statistical apparatus similar to that described above, if used somewhere, is exclusively in highly specialized research, far from the scope of application of marketing focus -groups
It is worth noting that there are two main reasons for this. The first is that the use of such a device greatly increases the cost of research, and a commercial customer is not inclined to pay for mathematical “beauties” if they do not in any way affect the final conclusions. For a number of reasons, which will be described below, both clients and researchers consider it quite sufficient to focus on the following subjective criterion: if the amount of new information received from each subsequent group has dropped sharply, the study should be stopped.

The second reason is much more fundamental. It is worth noting that it is connected with the fact that today, strictly operational and amenable to automation, the isolation of semantic units from texts is possible only at the level of words and stable phrases. Isolation, grouping and topologization of more complex semantic units, carried out at the analytical stage of qualitative sociological research, can only be carried out by a person on the basis of not yet studied unconscious intellectual algorithms. Rapid progress in the development of computer-assisted translation programs suggests that, over time, automated recognition of increasingly complex meaning units will become feasible. However, this work has not yet had any impact on the practice of focus group research. In our study of the literature on marketing focus groups, we never came across any mention of the use of content analysis in any form. There are such references in the field of academic research, but studying this issue requires special work. Let us note here that in the early 90s, the most modern work on computer content analysis methods was considered to be the work of Weber.

To summarize, let us turn to the issue of identifying areas of valid quantitative and qualitative research. It was shown above that these areas are fundamentally different, since the classes of problems they solve are radically different. The area of ​​valid application of formalized surveys seems limitless or very wide only at first glance. In fact, it is limited to identifying the degree of prevalence of certain knowledge, opinions or attitudes, which:

a) must be known in advance, i.e. before the survey;

b) should not be a fiction imposed on the respondent or pseudo-judgments that are not inherent in his consciousness.

Quantitative methods are not suitable for identifying the very fact of the existence of knowledge, opinions or attitudes, as can be clearly seen from the following comparison of survey results.

A. Quantitative Research

Question: What do you prefer - apple pie or chocolate cupcake? (% of the number of respondents)

Apple Pie - 26%

Chocolate cupcake - 22%

Both - 43%

Difficult to answer - 9%

B. Qualitative research

Question: What do you prefer - apple pie or chocolate cupcake?

Answer: I don't know. I love both.

Question: Well, if you should not forget that you need to take one thing, what will it be? Think about it.

Answer: Of course, the pies differ. If I have the opportunity to take my mother's apple pie, I will prefer it to any chocolate cupcake. If it is extremely important to take some kind of apple pie, then I don’t know for sure.

Question: What else might it depend on? Don’t forget what your choice is?

Answer: For example, it depends on what I eat for lunch. In case I have a full lunch, I think I'll have apple pie. Apple pie is a big delicacy in my family. But if I ate something light for lunch, like fish, then it’s better to take a muffin. If it's cold, I won't refuse a chocolate cupcake.

The above dialogue well illustrates the fact that the simple answer “I choose apple pie” depends on many factors, in this case - on who prepared the pie, on the degree of hunger, the density of lunch, and the ambient temperature. This list can probably be continued. But, as in many other cases, the number of such factors, or at least the most common ones, does not seem to be very large. The task of qualitative research, as already mentioned, will be to identify a list of these factors with a reasonable degree of completeness. In this area, qualitative research has a high degree of validity. It is appropriate to note that determining the frequency distribution of the effects of identified factors in the population being studied is a matter of quantitative research. Do not forget that, however, two caveats are important:

a) from a practical point of view, the costs of conducting a quantitative study may exceed the expected risk from making a strong-willed decision based on less accurate information;

b) adequate transformation of identified factors into questions in a formalized questionnaire is often difficult or impossible, and it is often extremely difficult to even determine the possible degree of inadequacy.

These circumstances often reduce the validity of quantitative research to such an extent that it becomes impractical to conduct it.

Only in those cases where the hypothesis about the validity of the wording of questions in formal questionnaires seems reasonable or plausible, can quantitative research produce a valid result that allows decisions to be made based on more accurate information.

Validity of the method. The validity of a research and diagnostic method (literally means “complete, suitable, appropriate”) shows to what extent the quality (property, characteristic) it is intended to evaluate is measured. Validity (adequacy) speaks of the degree to which the method corresponds to its purpose. The closer the diagnostic feature is revealed for which the method is intended to detect and measure, the higher its validity.

The concept of validity refers not only to the methodology, but also to the criterion for assessing its quality, validity criterion. This is the main sign by which one can practically judge whether a given technique is valid.

There are several types of validity of diagnostic methods.

Theoretical (conceptual) validity is determined by the correspondence of the indicators of the quality being studied, obtained using this technique, to the indicators obtained using other techniques (with the indicators of which there should be a theoretically justified relationship). Theoretical validity is tested by correlations of indicators of the same property obtained using different methods associated with the same theory.

Empirical (pragmatic) validity is checked by the correspondence of diagnostic indicators to real life behavior, observed actions and reactions of the subject. If, for example, using a certain technique we assess the character traits of a given subject, then the technique used will be considered practically or empirically valid when we establish that this person behaves in life exactly as the technique predicts, i.e. in accordance with his character trait.

Internal validity means the compliance of the tasks, subtests, judgments, etc. contained in the methodology. the overall goal and intent of the methodology as a whole. It is considered internally invalid or insufficiently internally valid when all or part of the questions, tasks or subtests included in it do not measure what is required of this technique.

External validity- this is approximately the same as empirical validity, with the only difference that in this case we are talking about the connection between the indicators of the method and the most important, key external signs related to the behavior of the subject.

Apparent validity describes the idea of ​​the method that the subject has, i.e. it is validity from the point of view of the subject. The technique should be perceived by the subject as a serious tool for understanding his personality, somewhat similar to medical diagnostic tools.

Predictive validity is established using a correlation between the indicators of the method and some criterion characterizing the property being measured, but at a later time. L. Cronbach considers predictive validity to be the most convincing evidence that a technique measures exactly what it was intended to measure.



Content validity determined by confirming that the tasks of the methodology reflect all aspects of the studied area of ​​behavior. Content validity is often called “logical validity” or “definitional validity.” It means that the method is valid according to experts. It is usually determined by achievement tests. In practice, to determine content validity, experts are selected to indicate which domain(s) of behavior is most important.

From the description of the types of validity it follows that there is no single indicator by which the validity of a diagnostic technique is established. However, the developer must provide significant evidence in favor of the validity of the proposed methodology.

It is easy to see the direct connection between validity and reliability. A technique with low reliability cannot have high validity, since the measuring instrument is incorrect and the trait that it measures is unstable. This technique, when compared with an external criterion, can show high agreement in one case, and extremely low agreement in another. It is clear that with such data it is impossible to draw any conclusions about the suitability of the technique for its intended purpose.

Deriving a validity coefficient is a labor-intensive procedure that is not necessary in cases where the technique is used by the researcher to a limited extent and is not intended to be used on a wide scale. The validity coefficient is subject to the same requirements as the reliability coefficient: the more methodologically perfect the criterion, the higher the validity coefficient should be. A low validity coefficient is most often noted when focusing on minor aspects.

Reliability of the research method. Reliability is one of the criteria for the quality of a diagnostic result, relating to the degree of accuracy and stability of the indicators of the symptom being diagnosed. The greater the reliability of the technique, the freer it is from measurement errors. In the broadest sense, reliability is a characteristic of the extent to which differences identified among subjects in the results of a technique are a reflection of actual differences in the properties being measured and to what extent they can be attributed to random errors.

In diagnostic theory, the concept of reliability has two meanings: the reliability of a technique as a specific tool (for example, using a meter, we are sure that it remains unchanged, no matter what measurements we make) and the relative immutability of the diagnostic object (we must be sure that under normal conditions the measured value will remain unchanged).

The concept of reliability is associated with the accuracy of measurements, or rather, with the assessment of error and the determination on this basis of the true value of a quantity.

There are three main techniques for assessing the reliability of a diagnostic technique.

Acceptance of retest, or repeated diagnostics, allows you to process the same tasks performed by the same subjects at different times, and calculate the relationship between the results, expressed in the self-correlation coefficient.

Reception of halving– the selection of tasks once completed is divided in half (for example, the first half-test includes tasks with an odd serial number, and the second half-test includes tasks with an even number), then the results of each subject are determined for both half-tests and the correlation coefficient between the results obtained is calculated.

Taking a parallel test – to measure the same knowledge, two different sets of tasks are constructed, which in their content resemble twins; both parallel sets of tasks are offered immediately after each other or as appropriate.

In all cases, with the correlation coefficient of the methods r> 0.7, the technique is considered reliable (for the correlation coefficient, see Section 4.2).

In the test methodology, it is customary to take into account three reliability coefficients:

1) stability coefficient, or constancy, is an indicator of the correlation between the results of the first and repeated tests using the same test of the same sample of subjects;

2) equivalence coefficient, or the correlation coefficient, the results of testing the same contingent of subjects using variants of the same test or different, but equivalent in form and purpose, tests;

3) coefficient of internal constancy, or internal homogeneity, which corresponds to the correlation of the results of parts of the test performed by the same subjects.

3. Classifications of pedagogical research methods

There are several classifications of pedagogical research methods. Depending on the basis of classification, research methods in pedagogy are divided into:

· empirical and theoretical;

· ascertaining and transformative;

· qualitative and quantitative;

· private and general;

· methods of collecting empirical data, testing and refuting hypotheses and theories;

· methods of description, explanation and forecast;

· special methods used in individual pedagogical sciences;

· methods for processing research results, etc.

TO general scientific methods (used by different sciences) include:

· general theoretical(abstraction and concretization, analysis and synthesis, comparison, contrast, induction and deduction, i.e. logical methods);

· sociological(questionnaires, interviews, expert surveys, ratings);

· socio-psychological(sociometry, testing, training);

· mathematical(ranking, scaling, indexing, correlation).

TO concrete scientific (specific pedagogical) include methods, which in turn are divided into theoretical and empirical (practical).

Theoretical methods serve for interpretation, analysis and generalization of theoretical positions and empirical data. This is a theoretical analysis of literature, archival materials and documents; analysis of basic concepts and research terms; method of analogies, building hypotheses and thought experiments, forecasting, modeling, etc.

Empirical methods are intended for the creation, collection and organization of empirical material - facts of pedagogical content, products of educational activities.

Empirical methods include, for example, observation, conversation, interviewing, questioning, methods for studying the products of students’ activities, school documentation, assessment methods (rating, pedagogical council, self-assessment, etc.), measurement and control methods (scaling, cross-sections, testing etc.), as well as a pedagogical experiment and experimental verification of the research findings in a public school setting. Both theoretical and empirical methods are usually used in conjunction with mathematical and statistical methods, which are used to process data obtained during the study, as well as to establish quantitative relationships between the phenomena being studied.

Mathematical methods are used to process data obtained by survey and experimental methods, as well as to establish quantitative relationships between the phenomena being studied.

Most common mathematical methods used in pedagogy are:

· registration – identifying the presence of a certain quality in each group member and a general count of those who have or do not have this quality (for example, the number of students actively working in class and the number of passive ones);

· ranging (rank score)– arrangement of the collected data in a certain sequence (in descending or ascending order of some indicators) and, accordingly, determining the place in this series of each person being studied (for example, compiling a list of the most preferred classmates);

· scaling – introduction of digital indicators in the assessment of individual aspects of pedagogical phenomena; For this purpose, subjects are asked questions, answering which they must choose one of the specified assessments (for example, in the question about engaging in any activity in their free time, select one of the assessment answers: I am interested in, I am engaged in regularly, I am engaged in irregularly, I am not engaged in anything).

Statistical methods are used when processing bulk material– determining the average values ​​of the obtained indicators: the arithmetic mean, the median - the indicator of the middle of the series, calculating the degree of dispersion around these values ​​- dispersion, coefficient of variation, etc.

Research validity was defined by Cook and Campbell in 1979 as the best available approximation of true statements, including statements involving cause-and-effect relationships. This definition relates to establishing the accuracy of research findings and emphasizes the relative nature of the truth that can be achieved in the social sciences. In any scientific study, the researcher must be able to answer the following questions:

1) is there a relationship between two variables;

2) whether this dependence is causal in nature;

3) is this relationship significant;

4) whether the measurement and observation procedures actually relate to the constructs being studied;

5) whether the causal dependencies identified during the study can be generalized.

Let us highlight the following types of validity related to these issues.

1. Validity of statistical inferences

This type of validity corresponds to testing the statistical significance of the relationship between two variables. Such conclusions are always probabilistic. Indeed, one can make two types of errors: deciding that a relationship is significant when it is not, or deciding that there is no significant relationship between variables when, on the contrary, it is.

There are some factors that can reduce the validity of statistical conclusions:

1) poor sensitivity of research, which manifests itself when the sample size is insufficient or when there is large variability in the groups being compared, that is, the subjects are too different and differ greatly from each other regarding some variables;

2) low reliability of measurement techniques or variable manipulation procedures used in the study;

3) interference factors present in the experimental conditions;

4) violation of accepted rules of conduct and processing that are established for various statistical methods.

A strategy for increasing the validity of statistical inferences is to reduce the variability of error by using, for example, a repeated-sample design or the use of homogeneous groups. The statistical validity of a study can be diagnosed both at the research design stage (for example, checking the sample size calculation) and after the study to evaluate its results.

2. Internal validity

Internal validity is one of the most important types of validity and is really concerned with the relationships between dependent and independent variables. This validity is associated with specific procedures that allow us to determine the extent to which the conclusions drawn in a given study are reliable. Once the existence of a relationship between variable X and variable Y has been established, it is necessary to decide which of the variables is the cause and which is the effect, that is, determine the direction of this relationship. If Y is observed after X, then X can be said to be the cause of Y.


However, it may be that the dependence relationship between X and Y is caused by a third variable, C. To establish internal validity, it is necessary to consider all possibilities for the influence of a third variable, C, on variables X and Y, and eliminate them. A study is considered to have internal validity if it is demonstrated that there is a cause-and-effect relationship between the dependent and independent variables.

Reasons for reducing the internal validity of the study:

1. Mixing variables. This is one of the greatest dangers to the validity of an experiment. If, during an experiment, some random factor (non-experimental variable) interacts with the dependent variable and this interaction cannot be measured separately from the interaction of the dependent and independent variables, then the influence of the random and independent variables is indistinguishable. The problem of confounding is particularly acute in studies where the experimenter cannot control the independent variable.

2. Changes associated with subjects. When testing dependent variables, changes that occurred between two moments of observation may be caused not by independent variables, but by changes that occurred with the subjects themselves (for example, personal life events, changes in certain personality traits, etc.), that is, factors " maturity" and "history".

By “maturity” we mean changes that occurred in the subject between the pre-test and the post-test and which were not associated with the influence of independent variables. For example, in experiments on motor coordination, subjects may experience improvement due to training in the period between experiments. This influence cannot be confused with the influence of the independent variable. The “history” factor refers to events that happened to the subjects and that influenced the results of the experiment.

3. Pre-test influence. The pre-test causes changes in the subjects, and therefore the results of the experiment in some cases may depend mainly on the pre-test rather than on the dependent variable.

4. Changing researcher skills. For example, a researcher, after some time, may become more experienced in observations and, therefore, interpret the behavior of subjects differently. In addition, the researcher may be influenced by factors such as fatigue, which can lead to errors in experiments.

5. Regression to the mean. This phenomenon occurs when individuals are tested repeatedly on the same variable. It has been established that if the subjects received results in the first test that were close in magnitude to the highest indicators of the scale, then during the repeated experiment their results decrease and become closer to the average, while the subjects who received results close to the lowest in the first test with repeated measurements they achieve better results. Regression to the mean is also observed in the case of errors associated with changes in a variable.

6. Dropout It is known that during the course of the study, some subjects leave the group. The remaining subjects are naturally different from those who dropped out.

Suppose two behavior modification techniques are being investigated to control body weight. Group 1 was prescribed a diet. In addition, subjects in the first group must write down everything they eat every day in a diary, accurately weigh all meals and count the caloric content of food. Group 2 was simply prescribed a diet. Obviously, some test groups with a more onerous task will drop out of the experiment. At the end of the experiment, the percentage of highly motivated subjects in this group will be greater. Subjects with higher motivation were more likely to lose weight. Therefore, the researcher may come to the erroneous conclusion that the conditions in the first group are more effective for weight loss.

Some authors also talk about construct validity. Construct validity is similar to internal validity and refers to the consistency between the findings and the theory that underlies the study. In order to assess construct validity, it is necessary to rule out other possible theoretical explanations for the results. If there is any doubt about how experimental results compare with theoretical results, it is necessary to design a new experiment that allows one to choose from several theoretical explanations for the results. This type of validity is the most difficult to obtain because there are numerous theories that can be used to explain the relationships of variables obtained in an experiment.

Let's consider two reasons for the decline in construct validity. The first is a weak connection between theory and experiment. Indeed, many psychological studies provide unclear operational definitions of theoretical concepts. The second reason is determined, firstly, by the fact that subjects very often begin to play the role of a “good” research object and behave in such a way as to please the experimenter, and secondly, by the fact that subjects, especially in experiments measuring them mental abilities or emotional stability, high anxiety regarding the expected assessment develops.

3. Validity of procedures

The third type of validity is the validity of procedures that allow variables to be varied and measured. Even the need to define in operational terms the conceptual variables relevant to the study is a source of risk. Indeed, “translating” the concept to the level of specific operations may inadequately reflect the theoretical principles of the study.

Often the researcher unconsciously stimulates the answer he expects to receive. This can be avoided by using hands-off research strategies and appropriate measurement methods. In this case, the subjects should not know that they are being observed, which makes it possible to remove unwanted motivation in relation to the experiment.

4. External validity

External validity refers to the ability to generalize the results of a study, that is, to extend the conclusions obtained from an experimental sample to the entire population. External validity depends significantly on the sampling method. There are three main types of sampling:

1. Random sampling. For example, the results of a study of a randomly selected group of adolescents will be valid with some degree of probability for all Italian adolescents. However, such a study can be very complex and expensive, since the sample must be large and homogeneous.

2. Heterogeneous (heterogeneous) sample. In accordance with the objectives of the study, various population groups are identified on which the results of the study are expected to be obtained. The random sample is then analyzed to ensure that it contains a sufficient number of representatives from each group.

3. Sample of a typical case. For example, the definition of the average young Italian is given. The study uses a sample consisting of individuals who meet this definition. Then, if an experiment is conducted with university students, for example, on the ability to negotiate, then one cannot expect that the findings will be applicable to heads of state.

External validity is also reduced by discrepancies between phenomena observed in the laboratory and phenomena observed in natural settings. It is difficult to determine whether the identified dependence occurs only in the laboratory or whether it is also observed outside the laboratory. External validity is ensured by repeated experimentation in heterogeneous conditions.

It is necessary to decide which type of validity is central to a given study. Indeed, procedures used to enhance one type of validity may reduce other types of validity.

For example, to increase the validity of statistical inferences, a researcher should use objects that are as heterogeneous as possible, thereby reducing the possibility of error. At the same time, external validity decreases.

The type of priority validity depends on the type of research being conducted. For example, if an experimental study establishes a cause-and-effect relationship between variables, then internal validity is essential. In contrast, when calculating correlations between variables, it is impossible to establish the direction of cause-and-effect relationships, so in this case, internal validity is not of interest compared to other types of validity.

Related to the concept of validity is the concept control. Control refers to any means used to eliminate the possibility of reducing the validity of a study. In practice, the researcher examines what factors may reduce the validity of the study and what methods can be used to neutralize these factors.

There are six main control methods.

1. One of the most commonly used control methods is to conduct an experiment with a group of subjects who are not influenced by the variable under study and who are compared with subjects who are exposed to this influence. For example, two groups are examined regarding an independent variable. Group 1 receives the intervention and is called experimental. Group 2 receives no treatment and is called the control group. The results of the experimental group are compared with the results of the control group. If two groups were identical before an experimental intervention, then any difference between them observed after the experiment can be attributed to that intervention.