Basics of test theory. Characteristics of control testing in physical education

A measurement or test performed to determine the condition or ability of an athlete is called test. Not all measurements can be used as tests, but only those that meet special requirements: standardization, the presence of a rating system, reliability, information content, objectivity. Tests that meet the requirements of reliability, information content and objectivity are called solid.

The testing process is called testing, and the resulting numerical values ​​are test result.

Tests based on motor tasks are called motor or motor. Depending on the task facing the subject, three groups of motor tests are distinguished.

Types of motor tests

Test name

Task for the athlete

Test result

Control exercise

Motor achievements

1500m run time

Standard functional tests

The same for everyone, dosed: 1) according to the amount of work performed; 2) by the magnitude of physiological changes

Physiological or biochemical indicators during standard work Motor indicators during a standard amount of physiological changes

Heart rate registration during standard work 1000 kgm/min Running speed at heart rate 160 beats/min

Maximum functional tests

Show maximum result

Physiological or biochemical indicators

Determination of maximum oxygen debt or maximum oxygen consumption

Sometimes not one, but several tests are used that have a common final goal. This group of tests is called battery of tests.

It is known that even with the most stringent standardization and precise equipment, test results always vary somewhat. Therefore, one of the important conditions for selecting good tests is their reliability.

Reliability of the test is the degree of agreement between results when the same people are repeatedly tested under the same conditions. There are four main reasons causing intra-individual or intra-group variation in test results:

    change in the condition of the subjects (fatigue, change in motivation, etc.); uncontrolled changes in external conditions and equipment;

    change in the state of the person conducting or evaluating the test (well-being, change of experimenter, etc.);

    imperfection of the test (for example, obviously imperfect and unreliable tests - free throws into a basketball basket before the first miss, etc.).

The reliability criterion for the test can be reliability factor, calculated as the ratio of the true dispersion to the dispersion recorded in the experiment: r = true s 2 / recorded s 2, where the true value is understood as the dispersion obtained from an infinitely large number of observations under the same conditions; the variance recorded is derived from experimental studies. In other words, the reliability coefficient is simply the proportion of true variation in the variation that is recorded in experiment.

In addition to this coefficient, they also use reliability index, which is considered as a theoretical coefficient of correlation or relationship between the recorded and true values ​​of the same test. This method is most common as a criterion for assessing the quality (reliability) of a test.

One of the characteristics of test reliability is its equivalence, which reflects the degree of agreement between the results of testing the same quality (for example, physical) by different tests. The attitude towards test equivalence depends on the specific task. On the one hand, if two or more tests are equivalent, their combined use increases the reliability of the estimates; on the other hand, it seems possible to use only one equivalent test, which will simplify testing.

If all tests included in a battery of tests are highly equivalent, they are called homogeneous(for example, to assess the quality of jumping ability, it must be assumed that long jumps, high jumps, and triple jumps will be homogeneous). On the contrary, if there are no equivalent tests in the complex (such as for assessing general physical fitness), then all the tests included in it measure different properties, i.e. essentially the complex is heterogeneous.

The reliability of tests can be increased to a certain extent by:

    more stringent standardization of testing;

    increasing the number of attempts;

    increasing the number of evaluators and increasing the consistency of their opinions;

    increasing the number of equivalent tests;

    better motivation of subjects.

Test objectivity there is a special case of reliability, i.e. independence of test results from the person conducting the test.

Information content of the test– this is the degree of accuracy with which it measures the property (the quality of the athlete) that it is used to evaluate. In different cases, the same tests may have different information content. The question of the informativeness of the test breaks down into two specific questions:

What does this test change? How exactly does it measure?

For example, is it possible to use an indicator such as MPC to judge the preparedness of long distance runners, and if so, with what degree of accuracy? Can this test be used in the control process?

If the test is used to determine the condition of the athlete at the time of examination, then they speak of diagnostic information content of the test. If, based on the test results, they want to draw a conclusion about the athlete’s possible future performance, they talk about prognostic information content. A test can be diagnostically informative, but not prognostically, and vice versa.

The degree of information content can be characterized quantitatively - based on experimental data (the so-called empirical information content) and qualitatively - based on a meaningful analysis of the situation ( logical information content). Although in practical work, logical or meaningful analysis should always precede mathematical analysis. An indicator of the informativeness of a test is the correlation coefficient calculated for the dependence of the criterion on the result in the test, and vice versa (the criterion is taken to be an indicator that obviously reflects the property that is going to be measured using the test).

In cases where the information content of any test is insufficient, a battery of tests is used. However, the latter, even with high separate information content criteria (judging by the correlation coefficients), does not allow us to obtain a single number. Here a more complex method of mathematical statistics can come to the rescue - factor analysis. Which allows you to determine how many and which tests work together on a separate factor and what is the degree of their contribution to each factor. It is then easy to select tests (or combinations thereof) that most accurately assess individual factors.

1 What is a test called?

2 What is testing?

Quantifying a quality or condition of an athlete A measurement or test conducted to determine the condition or ability of an athlete Testing process that quantitatively evaluates a quality or condition of an athlete No definition needed

3 What is the test result called?

Quantifying a quality or condition of an athlete A measurement or test conducted to determine the condition or ability of an athlete Testing process that quantitatively evaluates a quality or condition of an athlete No definition needed

4 What type of tests is this? 100m run?

5 What type of tests is this? hand dynamometry?

Control exercise Functional testMaximum functional test

6 What type of tests does the sample belong to? IPC?

Control exercise Functional testMaximum functional test

7 What type of tests is this? three-minute run with a metronome?

Control exercise Functional testMaximum functional test

8 What type of tests is this? maximum number of pull-ups on the bar?

Control exercise Functional testMaximum functional test

9 In what cases is a test considered informative?

10 When is a test considered reliable?

The ability of the test to be reproducible when tested again The ability of the test to measure the athlete quality of interest The independence of the test results from the person administering the test

11 In what case is the test considered objective?

The ability of the test to be reproducible when tested again The ability of the test to measure the athlete quality of interest The independence of the test results from the person administering the test

12 What criterion is necessary when evaluating a test for information content?

13 What criterion is needed when evaluating a reliability test?

Student's T test Fisher's F test Correlation coefficient Coefficient of determination Dispersion

14 What criterion is needed when evaluating an objectivity test?

Student's T test Fisher's F test Correlation coefficient Coefficient of determination Dispersion

15 What is the information content of a test called if it is used to assess the degree of fitness of an athlete?

16 What information content of control exercises is the coach guided by when selecting children for his sports section?

Logical Predictive Empirical Diagnostic

17 Is correlation analysis necessary to assess the information content of tests?

18 Is factor analysis necessary to assess the information content of tests?

19 Is it possible to assess the reliability of a test using correlation analysis?

20 Is it possible to assess the objectivity of a test using correlation analysis?

21 Will tests designed to assess general physical fitness be equivalent?

22 When measuring the same quality with different tests, tests are used...

Designed to measure the same quality Having a high correlation between each other Having a low correlation between each other

FUNDAMENTALS OF VALUATION THEORY

To evaluate sports results, special points tables are often used. The purpose of such tables is to convert the shown sports result (expressed in objective measures) into conditional points. The law of converting sports results into points is called rating scale. The scale can be specified as a mathematical expression, table or graph. There are 4 main types of scales used in sports and physical education.

Proportional scales

Regressing scales

Progressive scales.

Proportional scales suggest the awarding of the same number of points for an equal increase in results (for example, for every 0.1 s of improvement in the result in the 100 m run, 20 points are awarded). Such scales are used in modern pentathlon, speed skating, ski racing, Nordic combined, biathlon and other sports.

Regressing scales suggest that for the same increase in results as sporting achievements increase, an increasingly smaller number of points are awarded (for example, for an improvement in the result in the 100 m run from 15.0 to 14.9 s, 20 points are added, and for 0.1 s in the range 10.0-9.9 s – only 15 points).

Progressive scales. Here, the higher the athletic result, the greater the increase in points for its improvement (for example, for an improvement in running time from 15.0 to 14.9 s, 10 points are added, and from 10.0 to 9.9 s - 100 points). Progressive scales are used in swimming, certain types of athletics, and weightlifting.

Sigmoid scales are rarely used in sports, but are widely used in assessing physical fitness (for example, this is what the scale of physical fitness standards for the US population looks like). In these scales, improvements in results in the zone of very low and very high achievements are sparingly rewarded; The increase in results in the middle achievement zone brings the most points.

The main objectives of assessment are:

    compare different achievements in the same task;

    compare achievements in different tasks;

    define standards.

The norm in sports metrology, the limit value of the result is called, which serves as the basis for assigning an athlete to one of the classification groups. There are three types of norms: comparative, individual, due.

Comparative standards are based on a comparison of people belonging to the same population. For example, dividing people into subgroups according to the degree of resistance (high, medium, low) or reactivity (hyperreactive, normoreactive, hyporeactive) to hypoxia.

Different gradations of assessments and norms

Percentage of subjects

Norms in scales

Verbal

in points

Percentile

Very low

Below M - 2

From M - 2 to M - 1

Below average

From M-1 to M–0.5

From M–0.5 to M+0.5

Above average

From M+0.5 to M+1

From M+1 to M+2

Very high

Above M+2

These norms characterize only the comparative successes of subjects in a given population, but do not say anything about the population as a whole (or on average). Therefore, comparative norms must be compared with data obtained from other populations and used in combination with individual and appropriate norms.

Individual norms are based on comparing the performance of the same athlete in different conditions. For example, in many sports there is no relationship between one’s own body weight and athletic performance. Each athlete has an individually optimal weight corresponding to their state of athletic fitness. This norm can be controlled at different stages of sports training.

Due standards are based on an analysis of what a person must be able to do in order to successfully cope with the tasks that life puts before him. An example of this can be the standards of individual physical training complexes, the proper values ​​of vital capacity, basal metabolic rate, body weight and height, etc.

1 Is it possible to directly measure the quality of endurance?

2 Is it possible to directly measure the quality of speed?

3 Is it possible to directly measure the quality of dexterity?

4 Is it possible to directly measure the quality of flexibility?

5 Is it possible to directly measure the strength of individual muscles?

6 Can the assessment be expressed in a qualitative characteristic (good, satisfactory, bad, pass, etc.)?

7 Is there a difference between a measurement scale and a rating scale?

8 What is a rating scale?

System for measuring sports results The law of converting sports results into points System for evaluating norms

9 The scale assumes the awarding of the same number of points for an equal increase in results. This …

10 For the same increase in results, fewer and fewer points are awarded as sporting achievements increase. This …

Progressive scale Regressive scale Proportional scale Sigmoid scale

11 The higher the sports result, the greater the increase in points, the improvement is assessed. This …

Progressive scale Regressive scale Proportional scale Sigmoid scale

12 Improvement in performance in the very low and very high achievement zones is rewarded sparingly; The increase in results in the middle achievement zone brings the most points. This …

Progressive scale Regressive scale Proportional scale Sigmoid scale

13 Norms based on the comparison of people belonging to the same population are called...

14 Norms based on comparing the performance of the same athlete in different conditions are called ...

Individual standards Due standards Comparative standards

15 Norms based on an analysis of what a person should be able to do in order to cope with the tasks assigned to him are called ...

Individual standards Due standards Comparative standards

BASIC CONCEPTS OF QUALIMETRY

Qualimetry(Latin qualitas - quality, metron - measure) studies and develops quantitative methods for assessing qualitative characteristics.

Qualimetry is based on several starting points:

Any quality can be measured;

Quality depends on a number of properties that form the “quality tree” (for example, the quality tree of exercise performance in figure skating consists of three levels - highest, middle, lowest);

Each property is determined by two numbers: relative indicator and weight; the sum of the property weights at each level is equal to one (or 100%).

Methodological techniques of qualimetry are divided into two groups:

Heuristic (intuitive), based on expert assessments and questionnaires;

Instrumental.

Expert is an assessment obtained by seeking the opinions of experts. Typical examples of expertise: judging in gymnastics and figure skating, competition for the best scientific work, etc.

Carrying out an examination includes the following main stages: forming its purpose, selecting experts, choosing a methodology, conducting a survey and processing the information received, including assessing the consistency of individual expert assessments. During the examination, the degree of consistency of expert opinions, assessed by the value, is of great importance rank correlation coefficient(in case of several experts). It should be noted that rank correlation underlies the solution of many qualimetry problems, since it allows mathematical calculations with qualitative characteristics.

In practice, an indicator of an expert's qualifications is often the deviation of his ratings from the average ratings of a group of experts.

Questionnaire is a method of collecting opinions by filling out questionnaires. Questionnaires, along with interviews and conversations, are survey methods. Unlike interviews and conversations, questioning involves written responses from the person filling out the questionnaire—the respondent—to a system of standardized questions. It allows you to study motives of behavior, intentions, opinions, etc.

Using questionnaires, you can solve many practical problems in sports: assessing the psychological status of an athlete; his attitude to the nature and direction of training sessions; interpersonal relationships in the team; own assessment of technical and tactical readiness; dietary assessment and many others.

1 What does qualimetry study?

Studying the quality of tests Studying the qualitative properties of a trait Studying and developing quantitative methods for assessing quality

2 Mathematical methods used in qualimetry?

Pair correlation Rank correlation Analysis of variance

3 What methods are used to assess the level of performance?

4 What methods are used to evaluate the diversity of technical elements?

Questionnaire method Expert assessment method Method not specified

5 What methods are used to assess the complexity of technical elements?

Questionnaire method Expert assessment method Method not specified

6 What methods are used to assess the psychological state of an athlete?

Questionnaire method Expert assessment method Method not specified

The first component, test theory, contains a description of statistical models for processing diagnostic data. It contains models for analyzing answers in test tasks and models for calculating total test results. Mullenberg (1980, 1990) called this “psychometrics.” Classical test theory, modern test theory (or the Item Response Analysis model - IRT), and the


item samples constitute the three most important types of test theory models. The subject of consideration of psychodiagnostics is the first two models.

Classical test theory. Most intellectual and personality tests have been developed on the basis of this theory. The central concept of this theory is the concept of “reliability”. Reliability refers to the consistency of results across repeated assessments. In reference books, this concept is usually presented very briefly, and then a detailed description of the apparatus of mathematical statistics is given. In this introductory chapter we will present a concise description of the basic meaning of the noted concept. In classical test theory, reliability refers to the repeatability of the results of several measurement procedures (mainly measurements using tests). The concept of reliability involves the calculation of measurement error. The results obtained during the testing process can be presented as the sum of the true result and measurement error:

Xi = Ti+ Ej

Where Xi is an assessment of the results obtained, Ti is the true result, and Ej- measurement error.

The assessment of the results obtained is, as a rule, the number of correct answers to the test tasks. A true outcome can be thought of as a true evaluation in the Platonic sense (Gulliksen, 1950). The concept of expected results is widespread, i.e. ideas about scores that can be obtained as a result of a large number of repetitions of measurement procedures (Lord & Novich, 1968). But carrying out the same assessment procedure with one person is not possible. Therefore, it is necessary to look for other options to solve the problem (Witlman, 1988).

This concept makes certain assumptions about true results and measurement errors. The latter are taken as an independent factor, which, of course, is a completely reasonable assumption, since random fluctuations in the results do not give covariances: r EE =0.

It is assumed that there is no correlation between true scores and measurement errors: rEE =0.


The total error is 0, because The arithmetic mean is taken as the true estimate:

These assumptions ultimately lead us to the well-known definition of reliability as the ratio of the true result to the total variance or the expression: 1 minus the ratio, the numerator of which is the measurement error, and the denominator is the total variance:


, OR

From this formula for determining reliability we obtain that the error variance S 2 (E) equal to the total variance in the number of cases (1 – r XX "); thus, the standard error of measurement is determined by the formula:

After a theoretical justification of reliability and its derivatives, it is necessary to determine the reliability index of a particular test. There are practical procedures for assessing test reliability, such as using interchangeable forms (parallel tests), splitting items into two parts, retesting, and measuring internal consistency. Each reference book contains indices of consistency of test results:

r XX ’ =r(x 1 , x 2)

Where r XX ' - stability coefficient, and x 1 And x 2 - results of two measurements.

The concept of reliability of interchangeable forms was introduced and developed by Gulliksen (1950). This procedure is quite labor-intensive, since it is associated with the need to create a parallel series of tasks

r XX ’ =r(x 1 , x 2)

Where r XX ' - equivalence coefficient, and x 1 And x 2 - two parallel tests.

The next procedure - splitting the main dough into two parts A and B - is easier to use. The scores obtained from both parts of the test are correlated. Using the Spearman-Brown formula, the reliability of the test as a whole is assessed:

where A and B are two parallel parts of the test.

The next method is to determine the internal consistency of test tasks. This method is based on determining the covariances of individual tasks. Sg is the variance of a randomly selected task, and Sgh is the covariance of two randomly selected tasks. The most commonly used coefficient to determine internal consistency is Cronbach's alpha. The formula is also used KR20 and λ-2(lambda-2).

The classical concept of reliability defines measurement errors that arise both during testing and during observations. The sources of these errors are different: these can be personal characteristics, characteristics of the testing conditions, and the test tasks themselves. There are specific methods for calculating errors. We know that our observations may turn out to be erroneous, our methodological tools are imperfect, just as people themselves are imperfect. (How not to remember Shakespeare: “Untrustworthy are you, whose name is man”). The fact that in classical test theory measurement errors are made explicit and explained is an important positive point.

Classical test theory has a number of significant features that can also be considered as its disadvantages. Some of these characteristics are noted in reference books, but their importance (from an everyday point of view) is not often emphasized, nor is it noted that from a theoretical or methodological point of view they should be considered shortcomings.

First. Classical test theory and the concept of reliability are focused on calculating total test scores, which are the result of adding up the scores obtained in individual tasks. Yes, when working


Second. The reliability coefficient involves assessing the amount of dispersion of the measured indicators. It follows that the reliability coefficient will be lower if (other indicators being equal) the sample is more homogeneous. There is no single coefficient of internal consistency of test items; this coefficient is always “contextual”. Crocker and Algina (1986), for example, propose a special “homogeneous sample correction” formula designed for the highest and lowest scores obtained by test takers. It is important for the diagnostician to know the characteristics of variation in the sample population, otherwise he will not be able to use the internal consistency coefficients specified in the manual for this test.

Third. The phenomenon of reduction to an arithmetic mean is a logical consequence of the classical concept of reliability. If the test score fluctuates (i.e., it is not reliable enough), then it is possible that when the procedure is repeated, subjects with low scores will receive higher scores, and conversely, subjects with high scores will score low. This artifact of the measurement procedure should not be mistaken for true change or manifestation of developmental processes. But at the same time, it is not easy to differentiate between them, because... the possibility of change during development can never be ruled out. To be completely sure, a comparison with a control group is necessary.

The fourth characteristic of tests developed in accordance with the principles of classical theory is the presence of normative data. Knowledge of test norms allows the researcher to adequately interpret the test takers’ results. Outside of norms, test scores are meaningless. Developing test standards is a fairly expensive undertaking, since the psychologist must obtain test results from a representative sample.

2 Ya ter Laak

If we talk about the shortcomings of the classical concept of reliability, then it is appropriate to cite the statement of Siytsma (1992, pp. 123-125). He notes that the first and main assumption of classical test theory is that test scores follow the interval principle. However, there are no studies to support this assumption. In essence, it is “measurement according to an arbitrarily established rule.” This feature puts classical test theory at a disadvantage compared to attitude measurement scales and, of course, compared to modern test theory. Many methods of data analysis (variance analysis, regression analysis, correlation and factor analysis) are based on the assumption of the existence of an interval scale. However, it does not have a solid basis. Considering the scale of true results as a scale of values ​​of psychological characteristics (for example, arithmetic abilities, intelligence, neuroticism) can only be assumed.

The second remark concerns the fact that the test results are not absolute indicators of one or another psychological characteristic of the person being tested; they should be considered only as the results of one or another test. Two tests may purport to examine the same psychological characteristics (eg, intelligence, verbal ability, extraversion), but this does not mean that the two tests are equivalent or have the same capabilities. Comparing the performance of two people tested with different tests is incorrect. The same applies to the same test taker completing two different tests. The third point concerns the assumption that the standard error of measurement is the same for any level of individual ability being measured. However, there is no empirical test of this assumption. For example, there is no guarantee that a test taker with good math skills will score high on a relatively simple arithmetic test. In this case, a person with low or average abilities is more likely to receive a high rating.

Within the framework of modern test theory or the theory of answer analysis, test items contain a description of a large


number of models of possible answers from respondents. These models differ in the assumptions underlying them, as well as in the requirements for the data obtained. The Rasch model is often considered synonymous with theories of item response analysis (1RT). In fact, this is only one of the models. The formula presented in it for describing the characteristic curve of the g task is as follows:

Where g- separate test task; exp- exponential function (nonlinear dependence); δ (“delta”) - the level of difficulty of the test.

Other test items, e.g. h, also obtain their own characteristic curves. Condition fulfilled δ h >δ g (g means that h- a more difficult task. Therefore, for any value of the indicator Θ (“theta” - latent properties of test takers’ abilities) probability of successful completion of the task h less. This model is called strict because it is obvious that with a low degree of trait expression, the probability of completing the task is close to zero. There is no room for guessing or guesswork in this model. For multiple-choice tasks, there is no need to make assumptions about the likelihood of success. In addition, this model is strict in the sense that all test items must have the same discriminative ability (high discriminativeness is reflected in the steepness of the curve; here it is possible to construct the Guttman scale, according to which at each point of the characteristic curve the probability of completing the task varies from O to 1). Because of this condition, not all items can be included in tests based on the Rasch model.

There are several variants of this model (eg, Birnbaura, 1968; See Lord & Novik). It allows the existence of tasks with different discriminative

ability.

The Dutch researcher Mokken (1971) developed two models for analyzing test item responses that are less stringent than the Rasch model and therefore perhaps more realistic. As a basic condition

Via Mokken puts forward the proposition that the characteristic curve of a task should follow monotonously, without breaks. All test tasks are aimed at studying the same psychological characteristic, which should be measured V. Any form of this dependence is allowed until it is interrupted. Therefore, the shape of the characteristic curve is not determined by any specific function. This “freedom” allows you to use more test items, and the level of assessment is no higher than usual.

The methodology of item response patterns (IRT) differs from that of most experimental and correlational studies. The mathematical model is designed to study behavioral, cognitive, emotional characteristics, as well as developmental phenomena. These phenomena in question are often limited to item responses, leading Mellenberg (1990) to call IRT a “mini-behavior theory.” The results of the study can, to a certain extent, be presented as consistency curves, especially in cases where theoretical understanding of the characteristics being studied is lacking. Until now, we have at our disposal only a few intelligence, aptitude and personality tests created on the basis of numerous models of IRT theory. Variants of the Rasch model are more often used in the development of achievement tests (Verhelst, 1993), while Mokken models are more suitable for developmental phenomena (see also Chapter 6).

The test taker's response to test items is the basic unit of IRT models. The type of response is determined by the degree of expression of the characteristic being studied in a person. Such a characteristic could be, for example, arithmetic or spatial abilities. In most cases, this is one or another aspect of intelligence, characteristics of achievements, or personality traits. It is assumed that there is a nonlinear relationship between the position of a given person in a certain range of the characteristics being studied and the probability of successfully completing a particular task. The nonlinearity of this dependence is in a certain sense intuitive. Famous phrases “Every beginning is difficult” (slow non-


linear start) and “Becoming a saint is not so easy” mean that further improvement after reaching a certain level is difficult. The curve slowly approaches, but almost never reaches a 100% success rate.

Some models rather contradict our intuitive understanding. Let's take this example. A person with a voluntary characteristic intensity index of 1.5 has a 60 percent probability of success in completing the task. This contradicts our intuitive understanding of such a situation, because you can either successfully cope with the task or not cope with it at all. Let's take this example: a person tries 100 times to reach a height of 1m 50 cm. Success accompanies him 60 times, i.e. it has a 60 percent success rate.

To assess the severity of a characteristic, at least two tasks are required. The Rasch model involves determining the severity of characteristics regardless of the difficulty of the task. This also goes against our intuition: suppose a person has an 80% chance of jumping above 1.30 m. If this is the case, then according to the task characteristic curve he has a 60% chance of jumping above 1.50 m and a 40% chance of jumping above 1.50 m. probability of jumping above 1.70 m. Therefore, regardless of the value of the independent variable (height), it is possible to estimate a person's ability to jump high.

There are about 50 IRT models (Goldstein & Wood, 1989). There are many nonlinear functions that describe (explain) the probability of success in completing a task or group of tasks. The requirements and limitations of these models are different, and these differences can be revealed by comparing the Rasch model and the Mokken scale. The requirements of these models include:

1) the need to determine the characteristic under study and assess the person’s position within the range of this trait;

2) assessment of the sequence of tasks;

3) checking specific models. In psychometrics, many procedures have been developed to test the model.

Some reference books discuss IRT theory as a form of test item analysis (see, for example,

Croker & Algina, J 986). One could, however, argue that IRT is a “mini-theory about mini-behavior.” Proponents of the IRT theory note that if intermediate-level concepts (models) are imperfect, then what can be said about more complex constructs in psychology?

Classical and modern test theories. People can't help but compare things that look almost the same. (Perhaps the everyday equivalent of psychometry consists mainly of comparing people on significant characteristics and choosing between them.) Each of the theories presented—the theory of measurement of estimation errors and the mathematical model of test responses—has its supporters (Goldstein & Wood, 1986).

IRT models have not been accused of being "rules-based assessments" like classical test theory. The IRT model is focused on the analysis of the characteristics being assessed. Personality characteristics and task characteristics are assessed using scales (ordinal or interval). Moreover, it is possible to compare the performance of different tests aimed at studying similar characteristics. Finally, reliability is not the same for each value on a scale, and average scores are generally more reliable than scores at the beginning and end of the scale. Thus, IRT models seem to be more theoretically superior. There are also differences in the practical use of modern test theory and classical theory (Sijstma, 1992, pp. 127-130). Modern test theory is more complex compared to the classical one, so it is less often used by non-specialists. Moreover, IRT has specific task requirements. This means that items must be excluded from the test if they do not meet the requirements of the model. This rule further applies to those tasks that were part of widely used tests built on the principles of classical theory. The test becomes shorter and, therefore, its reliability decreases.

IRT provides mathematical models to study real-world phenomena. Models should help us understand key aspects of these phenomena. However, here lies the main theoretical question. Models can be considered


as an approach to studying the complex reality in which we live. But model and reality are not the same thing. According to the pessimistic view, it is possible to model only isolated (and not the most interesting) types of behavior. You can also come across the statement that reality cannot be modeled at all, because it obeys more than just cause-and-effect laws. At best, it is possible to model individual (ideal) behavioral phenomena. There is another, more optimistic, view of the possibilities of modeling. The above position blocks the possibility of deep comprehension of the nature of the phenomena of human behavior. The application of one model or another raises some general, fundamental questions. In our opinion, there is no doubt that IRT is a concept theoretically and technically superior to classical test theory.

The practical purpose of tests, no matter on what theoretical basis they are created, is to determine significant criteria and establish on their basis the characteristics of certain psychological constructs. Does the IRT model have advantages in this regard as well? It is possible that tests based on this model do not predict more accurately than tests based on classical theory, and it is possible that their contribution to the development of psychological constructs is not more significant. Diagnosticians prefer criteria that are directly relevant to the individual, institution, or community. A model that is more scientifically advanced “ipso facto”* does not define a more appropriate criterion and is to a certain extent limited in explaining scientific constructs. It is obvious that the development of tests based on classical theory will continue, but at the same time new IRT models will be created, extending to the study of a larger number of psychological phenomena.

In classical test theory, the concepts of “reliability” and “validity” are distinguished. Test results must be reliable, i.e. the results of the initial and retesting should be consistent. Besides,

* ipso facto(varnish) - by itself (approx. transl.).

the results should be free (as far as possible) from estimation errors. Validity is one of the requirements for the results obtained. In this case, reliability is considered as a necessary, but not yet sufficient condition for the validity of the test.

The concept of validity suggests that the findings relate to something important in practical or theoretical terms. Conclusions drawn from test scores must be valid. Most often they talk about two types of validity: predictive (criterion) and constructive. There are also other types of validity (see Chapter 3). In addition, validity can be determined in the case of quasi-experiments (Cook & Campbell, 1976, Cook & Shadish, 1994). However, the main type of validity is still predictive validity, which is understood as the ability to predict something significant about future behavior from a test result, as well as the possibility of a deeper understanding of a particular psychological property or quality.

The types of validity presented are discussed in each reference book and are accompanied by a description of methods for analyzing test validity. Factor analysis is more appropriate to determine construct validity, and linear regression equations are used to analyze predictive validity. Certain characteristics (academic performance, effectiveness of therapy) can be predicted on the basis of one or more indicators obtained when working with intellectual or personality tests. Data processing techniques such as correlation, regression, analysis of variance, analysis of partial correlations and variances are used to determine the predictive validity of a test.

Content validity is also often described. It is assumed that all tasks and tasks of the test must belong to a specific area (mental properties, behavior, etc.). The concept of content validity characterizes the correspondence of each test item to the measured domain. Content validity is sometimes viewed as part of reliability or "generalizability" (Cronbach, Gleser, Nanda & Rajaratnam, 1972). However, when


When choosing tasks for achievement tests in a specific subject area, it is also important to pay attention to the rules for including tasks in the test.

In classical test theory, reliability and validity are treated relatively independently of each other. But there is another understanding of the relationship between these concepts. Modern test theory is based on the use of models. The parameters are estimated within a certain model. If a task does not meet the requirements of the model, then within the framework of this model it is considered invalid. Construct validation is part of the verification of the model itself. This validation refers primarily to testing the existence of a unidimensional latent trait of interest with known scale characteristics. Scale scores can certainly be used to determine appropriate measures, and they can be correlated with measures of other constructs to gather information about the convergent and divergent validity of the construct.

Psychodiagnostics is similar to language, described as the unity of four components presented at three levels. The first component, test theory, is analogous to syntax, the grammar of a language. Generative grammar is, on the one hand, an ingenious model, and on the other, a system that obeys rules. With the help of these rules, complex sentences are built on the basis of simple affirmative sentences. At the same time, however, this model leaves aside a description of how the communication process is organized (what is transmitted and what is perceived), and for what purposes it is carried out. Understanding this requires additional knowledge. The same can be said about test theory: it is necessary in psychodiagnostics, but it is not able to explain what a psychodiagnostician does and what his goals are.

1.3.2. Psychological theories and psychological constructs

Psychodiagnostics is always a diagnosis of something specific: personal characteristics, behavior, thinking, emotions. The tests are designed to assess individual differences. There are several concepts

individual differences, each of which has its own distinctive characteristics. If it is recognized that psychodiagnostics is not limited only to the assessment of individual differences, then other theories become essential for psychodiagnostics. An example is the assessment of differences in mental development processes and differences in the social environment. Although the assessment of individual differences is not an indispensable attribute of psychodiagnostics, there are nevertheless certain traditions of research in this area. Psychodiagnostics began with the assessment of differences in intelligence. The main purpose of the tests was “to determine the hereditary transmission of genius” (Gallon) or the selection of children for training (Binet, Simon). The measurement of IQ received theoretical understanding and practical development in the works of Spearman (Great Britain) and Thurstone (USA). Raymond B. Cattell did a similar thing to assess personality characteristics. Psychodiagnostics becomes inextricably linked with theories and ideas about individual differences in achievements (assessment of maximum capabilities) and forms of behavior (level of typical functioning). This tradition continues to be effective today. In textbooks on psychodiagnostics, differences in the social environment are much less often assessed compared to consideration of the characteristics of the developmental processes themselves. There is no reasonable explanation for this. On the one hand, diagnostics is not limited to certain theories and concepts. On the other hand, it needs theories, since it is in them that the content being diagnosed is determined (i.e., “what” is being diagnosed). For example, intelligence can be considered both as a general characteristic and as the basis for many abilities independent of each other. If psychodiagnostics tries to “escape” this or that theory, then the basis of the psychodiagnostic process becomes ideas of common sense. Research uses various methods of data analysis, and the general logic of research determines the choice of one or another mathematical model and determines the structure of the psychological concepts used. Such methods of mathematical statistics


ki, such as analysis of variance, regression analysis, factor analysis, and calculation of correlations, assume the existence of linear dependencies. If these methods are used incorrectly, they “introduce” their structure into the data obtained and the constructs used.

Ideas about differences in the social environment and personality development had almost no impact on psychodiagnostics. Textbooks (see, for example, Murphy & Davidshofer, 1988) examine classical test theory and discuss relevant methods of statistical processing, describe well-known tests, and discuss the use of psychodiagnostics in practice: in management psychology, in personnel selection, in assessing human psychological characteristics .

Theories of individual differences (as well as ideas about differences between social environments and mental development) are analogous to the study of the semantics of language. This is the study of essence, content, and meaning. Meanings are structured in a certain way (similar to psychological constructs), for example, by similarity or contrast (analogy, convergence, divergence).

1.3.3. Psychological tests and other methodological tools

The third component of the proposed scheme is tests, procedures and methodological means with the help of which information about personality characteristics is collected. Drene and Sijtsma (1990, p. 31) define tests as follows: “A psychological test is considered as a classification according to a certain system or as a measurement procedure that allows a certain judgment to be made about one or more empirically isolated or theoretically based characteristics of a specific aspect of human behavior (for within the test situation). In this case, the response of respondents to a certain number of carefully selected stimuli is examined, and the responses obtained are compared with test norms.”

Diagnostics requires tests and techniques to collect reliable, accurate and valid information about features

and characteristic personality traits, about human thinking, emotions and behavior. In addition to the development of test procedures, this component also includes the following questions: how tests are created, how tasks are formulated and selected, how the testing process proceeds, what are the requirements for testing conditions, how measurement errors are taken into account, how test results are calculated and interpreted.

The test development process distinguishes between rational and empirical strategies. The application of a rational strategy begins with defining basic concepts (for example, the concept of intelligence, extraversion), and test tasks are formulated in accordance with these concepts. An example of such a strategy is the concept of aspect analysis (the facet theory) of Guttman (1957, 1968, 1978). First, various aspects of the main constructs are determined, then tasks and assignments are selected in such a way that each of these aspects is taken into account. The second strategy is that tasks are selected on an empirical basis. For example, if a researcher were trying to create a vocational interest test that would differentiate doctors from engineers, this would be the procedure. Both groups of respondents must answer all test items, and those items for which statistically significant differences are found are included in the final test. If, for example, there are differences between groups in responses to the statement “I like to fish,” then that statement becomes an element of the test. The central premise of this book is that the test is linked to a conceptual or taxonomic theory that defines these characteristics.

The purpose of the test is usually defined in the instructions for its use. The test must be standardized so that it can assess differences between individuals rather than between test conditions. There are, however, deviations from standardization in procedures called “testing the limits” and “learning potential tests”. In these conditions, the respondent is assisted in the process


testing and then evaluate the effect of such a procedure on the result. Scoring for answers to assignments is objective, i.e. carried out in accordance with standard procedure. The interpretation of the results obtained is also strictly defined and carried out on the basis of test standards.

The third component of psychodiagnostics - psychological tests, instruments, procedures - contains certain tasks that are the smallest units of psychodiagnostics and in this sense the tasks are similar to the phonemes of a language. The number of possible combinations of phonemes is limited. Only certain phonemic structures can form words and sentences that ensure that information is conveyed to the listener. Also And test tasks: only in a certain combination with each other can they become an effective means of assessing the corresponding construct.

What is testing

In accordance with IEEE Std 829-1983 Testing is a process of software analysis aimed at identifying differences between its actually existing and required properties (defect) and at assessing the properties of the software.

According to GOST R ISO IEC 12207-99, the software life cycle defines, among others, auxiliary processes of verification, certification, joint analysis and audit. The verification process is the process of determining that software products function in full accordance with the requirements or conditions implemented in previous work. This process may include analysis, verification and testing (testing). The certification process is the process of determining the completeness of compliance of the established requirements, the created system or software product with their functional purpose. The joint review process is the process of assessing the states and, if necessary, the results of the work (products) of the project. The audit process is the process of determining compliance with requirements, plans and contract terms. Together, these processes make up what is usually called testing.

Testing is based on test procedures with specific inputs, initial conditions, and expected results, designed for a specific purpose, such as verifying a particular program or verifying conformance to a specific requirement. Test procedures can test various aspects of a program's functioning, from the correct operation of a particular function to the adequate fulfillment of business requirements.

When carrying out a project, it is necessary to consider in accordance with what standards and requirements the product will be tested. What tools (if any) will be used to find and document defects found. If you remember about testing from the very beginning of the project, testing the product under development will not bring unpleasant surprises. This means that the quality of the product will most likely be quite high.

Product life cycle and testing

Increasingly nowadays, iterative software development processes are used, in particular, technology RUP - Rational Unified Process(Fig. 1). With this approach, testing ceases to be an “off-the-cuff” process that occurs after programmers have written all the necessary code. Work on tests begins from the very initial stage of identifying requirements for a future product and is closely integrated with current tasks. And this places new demands on testers. Their role is not limited to simply identifying errors as fully and as early as possible. They must participate in the overall process of identifying and addressing the most significant project risks. To do this, for each iteration the testing goal and methods for achieving it are determined. And at the end of each iteration, it is determined to what extent this goal has been achieved, whether additional tests are needed, and whether the principles and tools for conducting tests need to be changed. In turn, each detected defect must go through its own life cycle.

Rice. 1. Product life cycle according to RUP

Testing is usually carried out in cycles, each of which has a specific list of tasks and goals. The testing cycle may coincide with an iteration or correspond to a specific part of it. Typically, a testing cycle is carried out for a specific system build.

The life cycle of a software product consists of a series of relatively short iterations (Figure 2). An iteration is a complete development cycle leading to the release of a final product or some shortened version of it, which expands from iteration to iteration to eventually become a complete system.

Each iteration usually includes tasks of work planning, analysis, design, implementation, testing and evaluation of achieved results. However, the relationship between these tasks can change significantly. In accordance with the relationship between various tasks in an iteration, they are grouped into phases. The first phase, Beginning, focuses on the analysis tasks. The second phase iterations, Development, focus on designing and testing key design solutions. In the third phase - Construction - the largest proportion of development and testing tasks. And in the last phase - Transfer - the tasks of testing and transferring the system to the Customer are solved to the greatest extent.

Rice. 2. Iterations of the software product life cycle

Each phase has its own specific goals in the product life cycle and is considered complete when those goals are achieved. All iterations, except perhaps the Beginning phase iterations, end with the creation of a functioning version of the system being developed.

Test categories

Tests vary significantly in the problems they solve and the technology they use.

Test categories Category description Types of testing
Current testing A set of tests performed to determine the functionality of new system features added.
  • Stress Testing;
  • business cycle testing;
  • stress testing.
Regression testing The purpose of regression testing is to verify that additions to the system do not reduce its capabilities, i.e. testing is carried out according to requirements that have already been met before adding new features.
  • Stress Testing;
  • business cycle testing;
  • stress testing.

Testing subcategories

Testing subcategories Description of the type of testing Subtypes of testing
Stress Testing Used to test all application functions without exception. In this case, the sequence of testing the functions does not matter.
  • functional testing;
  • interface testing;
  • database testing
Business cycle testing Used to test application functions in the sequence they are called by the user. For example, simulating all the actions of an accountant for the 1st quarter.
  • unit testing (unit testing);
  • functional testing;
  • interface testing;
  • database testing.
Stress testing

Used for testing

Application performance. The purpose of this testing is to determine the scope of stable operation of the application. During this testing, all available functions are called.

  • unit testing (unit testing);
  • functional testing;
  • interface testing;
  • database testing.

Types of testing

Unit testing (unit testing) - this type involves testing individual application modules. To obtain maximum results, testing is carried out simultaneously with the development of modules.

Functional testing - The purpose of this testing is to ensure that the test item is functioning properly. The correctness of navigation through the object is tested, as well as the input, processing and output of data.

Database testing - checking the functionality of the database during normal application operation, during overloads and in multi-user mode.

Unit testing

For OOP, the usual way to organize unit testing is to test the methods of each class, then the class of each package, and so on. We are gradually moving on to testing the entire project, and the previous tests are of the regression type.

The output documentation of these tests includes test procedures, input data, code executing the test, and output data. The following is the type of output documentation.

Functional testing

Functional testing of the test item is planned and carried out based on the testing requirements specified during the requirements definition stage. The requirements include business rules, use-case diagrams, business functions, and, if available, activity diagrams. The purpose of functional tests is to verify that the developed graphical components meet the specified requirements.

This type of testing cannot be fully automated. Therefore, it is divided into:

  • Automated testing (will be used in the case where it is possible to check the output information).

Purpose: to test data input, processing and output;

  • Manual testing (in other cases).

Purpose: Tests whether user requirements are met correctly.

It is necessary to execute (play) each of the use-cases, using both correct values ​​and obviously erroneous ones, to confirm correct functioning, according to the following criteria:

  • the product responds adequately to all input data (expected results are output in response to correctly entered data);
  • the product responds adequately to incorrectly entered data (corresponding error messages appear).

Database testing

The purpose of this testing is to ensure the reliability of database access methods, their correct execution, without violating data integrity.

It is necessary to use as many database calls as possible sequentially. An approach is used in which the test is designed in such a way as to “load” the database with a sequence of both correct values ​​and obviously erroneous ones. The reaction of the database to data input is determined, and the time intervals for their processing are estimated.

CHAPTER 3. STATISTICAL PROCESSING OF TESTING RESULTS

Statistical processing of test results allows, on the one hand, to objectively determine the results of the subjects, on the other hand, to assess the quality of the test itself, test tasks, in particular, to assess its reliability. The problem of reliability has received a lot of attention in classical test theory. This theory has not lost its relevance today. Despite the emergence of more modern theories, the classical theory continues to maintain its position.

3.1. BASIC PROVISIONS OF CLASSICAL TEST THEORY

3.2. TEST RESULTS MATRIX

3.3. GRAPHICAL REPRESENTATION OF TEST SCORE

3.4. MEASURES OF CENTRAL TENDENCY

3.5. NORMAL DISTRIBUTION

3.6. VARIATION OF TEST SCORES OF SUBJECTS

3.7. CORRELATION MATRIX

3.8. TEST RELIABILITY

3.9. TEST VALIDITY

LITERATURE

BASIC PROVISIONS OF CLASSICAL TEST THEORY

The creator of the Classical Theory of mental tests is the famous British psychologist, author of factor analysis, Charles Edward Spearman (1863-1945) 1. He was born on September 10, 1863, and served in the British Army for a quarter of his life. For this reason, he received his PhD degree only at the age of 41 2. Charles Spearman carried out his dissertation research at the Leipzig Laboratory of Experimental Psychology under the direction of Wilhelm Wundt. At that time, Charles Spearman was strongly influenced by the work of Francis Galton on testing human intelligence. Charles Spearman's students were R. Cattell and D. Wechsler. Among his followers are A. Anastasi, J. P. Guilford, P. Vernon, C. Burt, A. Jensen.

Lewis Guttman (1916-1987) made a major contribution to the development of classical test theory.

The classical test theory was first presented comprehensively and completely in the fundamental work of Harold Gulliksen (Gulliksen H., 1950) 4 . Since then, the theory has been somewhat modified, in particular, the mathematical apparatus has been improved. Classical test theory in a modern presentation is given in the book Crocker L., Aligna J. (1986) 5. Among domestic researchers, V. Avanesov (1989) 6 was the first to describe this theory. In the work of Chelyshkova M.B. (2002) 7 provides information on the statistical justification of the quality of the test.

Classical test theory is based on the following five basic principles.

1. The empirically obtained measurement result (X) is the sum of the true measurement result (T) and the measurement error (E) 8:

X = T + E (3.1.1)

The values ​​of T and E are usually unknown.

2. The true measurement result can be expressed as the mathematical expectation E(X):

3. The correlation of true and false components across the set of subjects is zero, that is, ρ TE = 0.

4. The erroneous components of any two tests do not correlate:

5. The erroneous components of one test do not correlate with the true components of any other test:

In addition, the basis of classical test theory is formed by two definitions - parallel and equivalent tests.

PARALLEL tests must meet the requirements (1-5), the true components of one test (T 1) must be equal to the true components of the other test (T 2) in each sample of subjects answering both tests. It is assumed that T 1 =T 2 and, in addition, are equal to the variance s 1 2 = s 2 2.

Equivalent tests must meet all the requirements of parallel tests with one exception: the true components of one test do not have to be equal to the true components of another parallel test, but they must differ by the same constant With.

The condition for the equivalence of two tests is written as follows:

where c 12 is the constant between the results of the first and second tests.

Based on the above provisions, a theory of test reliability has been constructed 9,10.

that is, the variance of the resulting test scores is equal to the sum of the variances of the true and error components.

Let's rewrite this expression as follows:

(3.1.3)

The right side of this equality represents the reliability of the test ( r). Thus, the reliability of the test can be written as:

Based on this formula, various expressions were subsequently proposed for finding the test reliability coefficient. The reliability of the test is its most important characteristic. If reliability is unknown, test results cannot be interpreted. The reliability of a test characterizes its accuracy as a measuring instrument. High reliability means high repeatability of test results under the same conditions.

In classical test theory, the most important problem is determining the true test score of the subject (T). The empirical test score (X) depends on many conditions - the level of difficulty of the tasks, the level of preparedness of the test takers, the number of tasks, testing conditions, etc. In a group of strong, well-prepared subjects, test results will usually be better. than in a group of poorly trained subjects. In this regard, the question remains open about the magnitude of the measure of task difficulty for the general population of subjects. The problem is that real empirical data are obtained from completely random samples of subjects. As a rule, these are study groups that represent a multitude of students who interact quite strongly with each other in the learning process and study in conditions that are often not repeated for other groups.

We'll find s E from equation (3.1.4)

Here the dependence of the measurement accuracy on the standard deviation is explicitly shown s X and on the reliability of the test r.

The applications, goals and objectives of software testing are varied, so testing is evaluated and explained in different ways. Sometimes it is difficult for the testers themselves to explain what “as is” software testing is. Confusion ensues.

To untangle this confusion, Alexey Barantsev (practitioner, trainer and consultant in software testing; a native of the Institute of System Programming of the Russian Academy of Sciences) precedes his testing trainings with an introductory video about the main provisions of testing.

It seems to me that in this report the lecturer was able to most adequately and balancedly explain “what testing is” from the point of view of a scientist and programmer. It’s strange that this text has not yet appeared on Habré.

I give here a condensed retelling of this report. At the end of the text there are links to the full version, as well as to the mentioned video.

Testing Basics

Dear Colleagues,

First, let's try to understand what testing is NOT.

Testing is not development,

Even if testers know how to program, including tests (automation testing = programming), they can develop some auxiliary programs (for themselves).

However, testing is not a software development activity.

Testing is not analysis,

And not the activity of collecting and analyzing requirements.

Although, during the testing process, sometimes you have to clarify the requirements, and sometimes you have to analyze them. But this activity is not the main one; rather, it has to be done simply out of necessity.

Testing is not management,

Despite the fact that in many organizations there is such a role as “test manager”. Of course, testers need to be managed. But testing in itself is not management.

Testing is not technical writing,

However, testers have to document their tests and their work.

Testing cannot be considered one of these activities simply because during the development process (or analyzing requirements, or writing documentation for their tests), testers do all this work for myself, and not for someone else.

An activity is significant only when it is in demand, that is, testers must produce something “for export.” What do they do “for export”?

Defects, defect descriptions, or test reports? This is partly true.

But this is not the whole truth.

Main activities of testers

is that they provide participants in a software development project with negative feedback about the quality of the software product.

“Negative feedback” does not have any negative connotation, and does not mean that the testers are doing something bad, or that they are doing something bad. It's just a technical term that means a fairly simple thing.

But this thing is very significant, and probably the single most significant component of the activities of testers.

There is a science - “systems theory”. It defines the concept of “feedback”.

“Feedback” is some data that goes back to the input from the output, or some part of the data that goes back to the input from the output. This feedback can be positive or negative.

Both types of feedback are equally important.

In software systems development, positive feedback is, of course, some kind of information we receive from end users. These are requests for some new functionality, this is an increase in sales (if we release a quality product).

Negative feedback can also come from end users in the form of some negative reviews. Or it can come from testers.

The sooner negative feedback is provided, the less energy is needed to modify that signal. That is why testing needs to start as early as possible, at the earliest stages of the project, and provide this feedback both at the design stage and, perhaps, even earlier, at the stage of collecting and analyzing requirements.

By the way, this is where the understanding grows that testers are not responsible for quality. They help those who are responsible for it.

Synonyms for the term "testing"

From the point of view that testing is the provision of negative feedback, the world-famous abbreviation QA (Quality Assurance) is definitely NOT synonymous with the term “testing”.

Merely providing negative feedback cannot be considered quality assurance, because Assurance is some positive measures. It is understood that in this case we ensure quality and take timely measures to ensure that the quality of software development improves.

But “quality control” - Quality Control, can be considered in a broad sense as a synonym for the term “testing”, because quality control is the provision of feedback in its most varied varieties, at various stages of a software project.

Sometimes testing is meant as some separate form of quality control.

The confusion comes from the history of testing development. At different times, the term “testing” meant various actions that can be divided into 2 large classes: external and internal.

External definitions

The definitions that Myers, Beiser, and Kaner gave at different times describe testing precisely from the point of view of its EXTERNAL significance. That is, from their point of view, testing is an activity that is intended FOR something, and does not consist of something. All three of these definitions can be summarized as providing negative feedback.

Internal Definitions

These are definitions that are contained in a standard for terminology used in software engineering, such as a de facto standard called SWEBOK.

Such definitions constructively explain WHAT the testing activity is, but do not give the slightest idea of ​​WHY testing is needed, for which all the results obtained from checking the correspondence between the actual behavior of the program and its expected behavior will then be used.

testing is

  • checking program compliance with requirements,
  • carried out by observing its work
  • in special, artificially created situations, chosen in a certain way.
From here on we will consider this to be the working definition of “testing”.

The general testing scheme is approximately as follows:

  1. The tester receives the program and/or requirements at the entrance.
  2. He does something with them, observes the work of the program in certain situations artificially created by him.
  3. At the output, it receives information about matches and non-matches.
  4. This information is then used to improve the existing program. Or in order to change the requirements for a program that is still being developed.

What is a test

  • This is a special, artificially created situation, chosen in a certain way,
  • and a description of what observations to make about the program's operation
  • to check whether it meets some requirement.
There is no need to assume that the situation is something momentary. The test can be quite long, for example, when testing performance, this artificially created situation can be a load on the system that continues for quite a long time. And the observations that need to be made are a set of different graphs or metrics that we measure during the execution of this test.

The test developer is engaged in selecting a limited set from a huge, potentially infinite set of tests.

Well, thus we can conclude that the tester does two things during testing.

1.Firstly, it controls the execution of the program and creates these very artificial situations in which we are going to check the behavior of the program.

2.And, secondly, he observes the behavior of the program and compares what he sees with what is expected.

If a tester automates tests, then he does not himself observe the behavior of the program - he delegates this task to a special tool or a special program that he himself wrote. It is she who observes, she compares the observed behavior with the expected one, and gives the tester only some final result - whether the observed behavior coincides with the expected one or does not coincide.

Any program is a mechanism for processing information. The input is information in one form, the output is information in some other form. At the same time, a program can have many inputs and outputs, they can be different, that is, a program can have several different interfaces, and these interfaces can have different types:

  • User Interface (UI)
  • Application Programming Interface (API)
  • Network protocol
  • File system
  • Environment state
  • Events
The most common interfaces are
  • custom,
  • graphic,
  • text,
  • cantilevered,
  • and speech.
Using all these interfaces, the tester:
  • somehow creates artificial situations,
  • and checks how the program behaves in these situations.

This is testing.

Other classifications of testing types

The most commonly used division into three levels is
  1. unit testing,
  2. integration testing,
  3. system testing.
Unit testing usually means testing at a fairly low level, that is, testing individual operations, methods, and functions.

System testing refers to testing at the user interface level.

Some other terms are sometimes used, such as "component testing", but I prefer to highlight these three, due to the fact that the technological division between unit and system testing does not make much sense. The same tools and the same techniques can be used at different levels. The division is conditional.

Practice shows that tools that are positioned by the manufacturer as unit testing tools can be used with equal success at the level of testing the entire application as a whole.

And tools that test the entire application at the user interface level sometimes want to look, for example, into the database or call some separate stored procedure there.

That is, the division into system and unit testing is generally speaking purely conditional, speaking from a technical point of view.

The same tools are used, and this is normal, the same techniques are used, at each level we can talk about testing of a different type.

We combine:

That is, we can talk about unit testing of functionality.

We can talk about system testing of functionality.

We can talk about unit testing, for example, efficiency.

We can talk about system effectiveness testing.

Either we consider the effectiveness of a single algorithm, or we consider the effectiveness of the entire system as a whole. That is, the technological division into unit and system testing does not make much sense. Because the same tools, the same techniques can be used at different levels.

Finally, during integration testing we check if modules within a system interact with each other correctly. That is, we actually perform the same tests as during system testing, only we additionally pay attention to how exactly the modules interact with each other. We perform some additional checks. That's the only difference.

Let us once again try to understand the difference between system and unit testing. Since this division occurs quite often, this difference should exist.

And this difference manifests itself when we perform not a technological classification, but a classification by purpose testing.

Classification by goals can be conveniently done using the “magic square”, which was originally invented by Brian Marik and then improved by Ari Tennen.

In this magic square, all types of testing are located in four quadrants, depending on what the tests pay more attention to.

Vertically - the higher the type of testing is, the more attention is paid to some external manifestations of the program’s behavior; the lower it is, the more attention we pay to its internal technological structure of the program.

Horizontally - the further to the left our tests are, the more attention we pay to their programming, the further to the right they are, the more attention we pay to manual testing and human research of the program.

In particular, terms such as acceptance testing, Acceptance Testing, and unit testing can easily be entered into this square in the sense in which it is most often used in the literature. This is low-level testing with a large, overwhelming share of programming. That is, all tests are programmed, carried out completely automatically, and attention is paid primarily to the internal structure of the program, precisely to its technological features.

In the upper right corner we will have manual tests aimed at some external behavior of the program, in particular, usability testing, and in the lower right corner we will most likely have tests of various non-functional properties: performance, security, and so on.

So, based on the classification by purpose, unit testing is in the lower left quadrant, and all other quadrants are system testing.

Thank you for your attention.