Measuring and rating scales. Standard deviation scale


Scaling test results

Stevens (1946) defined 4 levels of measurement scales, differing in the degree to which their ratings retain the properties of the set of real numbers. These are the scales:

Nominal (or nominative, naming scale)

Ordinal

Interval

Relationship scale.

Interpretation of test results

In tests with normative-oriented interpretation The main task is to determine the comparative place of each of the test takers in the general group of test takers. Obviously, the place of each subject depends on the background of which group he is being assessed against. The same result can be classified as quite high if the group is weak, and quite low if the group is strong. That is why it is necessary, whenever possible, to use standards that reflect the test results of a large representative sample of subjects.

In tests with criterion-oriented interpretation the task is to compare the educational achievements of each student with the amount of knowledge, skills and abilities planned for acquisition. In this case, a specific content area is used as an interpretative frame of reference, rather than a particular sample of subjects. The main problem is establishing a passing score that separates those who have mastered the material being tested from those who have not.

Establishing test performance standards

To eliminate the dependence of interpretation on the results of other test participants, special test performance norms are used, and thus, the primary score of an individual test taker is compared with the test performance norms. Norms – this is a set of indicators that are established empirically based on the results of a test performed by a clearly defined sample of subjects. The development and procedures for obtaining these indicators constitute rationing process(or standardization) test. The most common norms are the mean and standard deviation of multiple individual scores. Correlating the subject's primary score with performance standards allows us to establish the subject's place in the sample used to standardize the test.

Types of scales used to convert raw scores

The most famous primary score conversions are:

Percentile rank, reflecting the percentage of subjects in the normative group whose results are lower than or equal to a given value of the primary score;

Linear Z-score, defined as the ratio of the individual deviation of the test score to the standard deviation for the group of subjects;

Estimates that are a linear transformation z-scores (T-scale, standard IQ scores, etc.);

Stanine and sten scales, which are obtained by dividing the primary point scale into various intervals.

Percentile rank scale

Percentiles make it possible to establish the rank of the subject’s primary indicator in the normative group. The percentile rank corresponding to a given primary score shows the percentage of subjects in the normative sample whose results are not higher than this primary score.

Percentiles should not be confused with percentages that represent the percentage of items completed correctly by test takers in a group. In contrast to the latter - primary - percentile is a derived indicator, indicating the proportion of the total number of subjects in the group.

In addition to the convenience of ease of interpretation, percentile ranks have significant disadvantages. The percentile rank scale is non-linear, i.e. in different areas of the raw score scale, a 1-point increase may correspond to different increases on the percentile scale. Therefore, percentiles not only do not reflect, but even distort the real differences in the test result.

Therefore, the use of percentiles is quite limited. Due to their convenience and simplicity, they are used mainly in normative tests for self-assessment of students’ knowledge, reporting the results to the students themselves and their parents.

Z-scale

Converts individual results into a standard scale with an overall average score and a common measure of dispersion. Z- assessment i-th The student is found using the formula:

Where primary score i-th subject; - average of individual scores N test group ( i=1,2,…,N); -standard deviation of multiple primary scores.

Z-scale is standard with zero mean and one standard deviation. With its help, you can bring student scores obtained on various tests into one form convenient for comparison.

Magnitude Z-score is equal to the distance between the primary score in question and the average score for the group, expressed in standard deviation units: within how many standard deviations is the subject’s primary score below or above the group mean.

Z-scores, with rare exceptions, take values ​​from the range (-3,+3). While convenient for scientific analysis in the process of developing new tests, the Z-scale is inconvenient for practical use when assessing the knowledge of group subjects. Z-scores can take on fractional and negative values, which are difficult to handle in calculations and difficult to interpret for test users. Rounding Z-scores to whole numbers is not always acceptable because The main purpose of creating tests is to identify differences in the preparation of subjects. Negative Z-scores, which indicate results below the average for the group of students tested, also cause certain inconveniences - they will cause obvious rejection among the students receiving them. In general, all this makes the Z-score inconvenient for reporting results to test takers and forces the use of special conversion methods to assign grades to students.

Z-score transformations

Z-score conversions aim to translate them into values ​​that are easier to write and explain. However, the transformation used must be linear in order to preserve the shape of the Z-score distribution. The general formula for such a transformation has the form

Z 1 = M+ ?·Z ,

Where Z 1 is the converted estimate, M– new average value (average value of estimates after transformation), - new standard deviation. Different conversions have different meanings M And . Here are some of the most well-known Z-score transformations.

T-scale(McCall, 1939, for reporting children's performance on a test of mental ability). The average value is selected M = 50 and standard deviation? = 10. We get: Z 1 =50 + 10·Z

SEEV scale(ETS, for communicating college admissions test results to applicants). The average value is selected M = 500 and standard deviation? = 100. We get: Z 1 =500 + 100·Z

IQ scale(Weshler, 1939, for interpreting scores on adult intelligence scales). The average value is selected M = 100 and standard deviation? = 15. We get: Z 1 =100 + 15·Z

Stanine and Sten scales

Sometimes when reporting results, scales consisting of individual integers are used, for example, from 1 to 9 or from 1 to 10. This is convenient for reporting test results, because Such scales have obvious simplicity.

Dividing the normal distribution into 9 intervals results in a stanine scale having 9 standard units. On this scale, the mean is 5 and the standard deviation is approximately 2. When assessing subjects' performance on any test with any number of items, the worst 4% of results are assigned a stanine 1, and the best - a stanine 9. The next worst and best 7% of results are assigned a stanine 1. assign stanines 2 and 8 respectively. The next 12% of results are stanines 3 and 7. The next 17% are assigned stanines 4 and 6, and finally 20% of the average results are assigned stanines 5.

In the wall scale, often called the Cattell scale, the entire array of results is divided into 10 parts with an interval of 0.5 standard deviations. In the wall scale, the arithmetic mean is taken to be 5.5, and the distance between two adjacent standard units is 0.5.

Sometimes an eleven-point scale is obtained from the stanine scale by identifying one percent of the strongest and weakest subjects and assigning them a maximum and minimum score, respectively.

Establishing a passing score

There are many known methods for establishing a passing score in criterion-based testing. All methods are divided into absolute and relative. Almost all methods involve experts in the procedure for determining the passing score. Let's look at some of the well-known methods.

Task-centered methods

Nedelsky method(1954) – for closed tasks.

Each expert must analyze all tasks and cross out for each task the numbers of answers that a minimally competent subject will be able to refuse. For each task, the expert indicates the inverse of the number of remaining answers. For example, if in a task with five answers the expert crossed out two, then he will indicate the number 1/3 for this task. Then all these reciprocals are summed up. The resulting number can be considered as that expert's likely assessment of the minimally competent subject. Then the ratings of all experts are averaged.

Angoff method(1971). Experts are asked to imagine a group of minimally competent subjects and, for each task, to estimate the proportion of subjects in this group who answered the task correctly. (This is the same as estimating the probability that a minimally competent subject will answer an item correctly.) These probabilities are added up for each expert and averaged across all experts.

Ebel method(1972). This method uses a two-dimensional grid to categorize each task. Experts are asked to divide all tasks by difficulty (three levels of difficulty are offered - the task is easy, medium difficulty, difficult) and by the relevance of its content (4 levels of relevance are offered - essential, important, acceptable, controversial). Thus, all tasks are laid out in the cells of this grid. The experts must then evaluate how the minimally competent test taker will perform the tasks in each cell, i.e. indicate the percentage of the number of tasks in the cell that he must answer correctly.

Subject-centered methods(Nedelsky, 1954; Zieky, Livingston, 1977)

Contrast group method

The experts agree on what is the result of performing the test at the level of minimum competence. The experts then divide all subjects into two groups - competent and incompetent (excluding those who, in their opinion, are on the border). Next, graphs of the distribution of points for each group are plotted on one drawing. The point of intersection of the graphs is taken as a passing grade.

Boundary group method

In contrast to the previous method, experts are asked to identify subjects who, in their opinion, are on the border between two contrasting groups that differ in competence. The median of the distribution of scores of the selected group is taken as the passing score.

Critics of this approach point out that establishing a passing score based on the test takers' performance does not essentially correspond to the main purpose of criterion-referenced testing, because this approach is not related to the content of the test.

Standardization

– unification, bringing the test procedure and assessments to uniform standards. Thanks to the standardization of the methodology, comparability of the results obtained from different subjects is achieved and it becomes possible to express test scores in indicators relative to the standardization sample.

1) Standardization – processing and regulation of the procedure, unification of instructions, examination forms, methods of recording results, conditions for conducting the examination, characteristics of the populations of subjects. Strict frequency of the examination procedure is a prerequisite for ensuring the reliability of the test and determining test standards for assessing the results of the examination.

2) Standardization – transformation of the normal rating scale into a new scale based not on the quantitative values ​​of the indicator being studied, but on its relative place in the distribution of results in the sample of subjects.

Stages of standardization

Stage 1. Creation of a uniform testing procedure.

It consists of determining the moments of the diagnostic situation.

· Testing conditions (room, lighting, and other external factors).

· The content of the instruction and the features of its presentation (tone of voice, pauses, speed of speech, etc.).

· Availability of standard stimulus material (for example, Rorschach cards).

· Time restrictions for performing this test.

· Standard form for performing this test.

· Taking into account the influence of situational factors on the testing process and result.

· Taking into account the influence of the diagnostician’s behavior on the testing process and result

· Taking into account the influence of the subject's experience in testing.

Stage 2. Creating a uniform assessment of test performance. WITH standard interpretation of the results obtained and preliminary standard processing. At this stage, the obtained indicator is compared with the norm for performing this test for a given age.

Stage 3. Determination of test performance standards. Standards are developed for different ages, professions, genders, etc.

z-standard score

The most common transformations of primary estimates are centering and normalization using standard deviations. The normalization procedure involves moving to other units of measurement. The normalization function is usually Z-score (standard indicator), which expresses the deviation of an individual result X in units proportional to the standard deviation.

Standard indicators, calculated on the basis of linear and nonlinear transformation of primary indicators distributed according to a normal or close to normal law, have become more widespread in psychodiagnostics. In this calculation, a z-transformation of the estimates is performed. To determine the z-standard score, determine the difference between the individual primary outcome and the normal group mean, and then divide this difference by the δ of the normative sample.

X – raw score (number of tasks completed)

Мх – average value of completed tasks for the entire sample

δ – standard deviation (in foreign psychology SD)

Mathematician Carl Gauss proposed a function that describes the normal distribution. The graph of the normal distribution equation is a symmetrical unimodal bell-shaped curve (or bell curve ).

Let's call the arithmetic mean Mx, and the standard deviation δ (small sigma). With a normal distribution, all studied quantities are within the limits of Mx ± 5 δ.

Within Mx ± δ 68.26% are located, the remaining 31.74% are located symmetrically along 15.87

Within Mx ± 2 δ is 95.44%

And within Mx ± 3 δ is 99.72%

PERCENTILES

Percentile – the percentage of individuals from the standardization sample whose results are lower than this primary indicator. The percentile scale can be considered as a set of rank gradations with the number of ranks being 100 and starting from the 1st rank, corresponding to the lowest result;

50th percentile ( R 50 )corresponds to the median of the distribution of results

Percentiles should not be confused with regular percentages. The latter represent the proportion of correct solutions out of the total number of test items in the individual result. Ranks P 1 And R 100 receive, respectively, the lowest and highest results from those observed in the sample, however, these ranks may correspond to far from zero (not a single correct decision) or absolute (all decisions are correct) indicators. For example, with a total number of 120 tasks, the minimum result corresponding to the first rank may be 6 correct solutions, while the maximum result corresponding to the rank R 100 , will amount to 95 correctly solved tasks. This situation occurs, for example, when evaluating speed tests.

The main disadvantage of percentile scales is the unevenness of units of measurement. In a normal distribution, individual variables are tightly grouped in the center of the distribution and scatter as they move toward the edges. Therefore, equal frequencies of cases near the center correspond to shorter intervals along the x-axis, located at the edges of the distribution of estimates. Percentiles show the relative position of each subject in the normal sample, but not the magnitude of the differences between the results. This creates some inconvenience in interpreting individual results. Thus, the difference in primary indicators corresponding to the interval R 70 R 80, can amount to 10 points, and the difference in the number of correct solutions in the range of ranks R 50R 60, - only 1 – 3 points.

At the same time, percentile scores also have a number of advantages. They are easily understandable for users of psychodiagnostic information, are universal in relation to various types of techniques and are easy to calculate.

Statistical norms

A. Statistical norms. Boundary values ​​on the test score scale, formed on the basis of the frequency distribution of test scores in the standardization sample. As a rule, these cutoff values ​​separate a fixed percentage of subjects from the sample: (decile), 25 (quartile), 50 (median). With a normal distribution, the statistical norm is described using parameters (mean plus/minus sigma, or standard deviation). Statistical norms serve to make “comparative decisions” and do not provide information for making “normative decisions”

B. Age norms – private versions of psychodiagnostic norms collected for children of different ages.

IN. Criteria norms - diagnostic standards, which specify the correspondence between test scores on the scale of the property being measured and the level of the criterion indicator. In the case of criterion behavior, criterion norms indicate the probability of the occurrence of criterion behavior for a given test score.

G. School standards are developed on the basis of school achievement tests or school aptitude tests.

D. Professional standards. They are established on the basis of tests for various professional groups.

E. Local standards . They are established for narrow categories of people, distinguished by the presence of a common characteristic - age, gender, geographical area, socioeconomic status.

AND. National standards. Developed for representatives of a given nation or country as a whole.

STANINES

An example of a nonlinear scale transformed into a standard scale is the stanine scale (English: standard nine), where the ratings take values ​​from 1 to 9, M = 5, δ = 2

The stanine scale is becoming increasingly widespread, combining the advantages of standard scale indicators and the simplicity of percentiles. Primary indicators are easily converted into stanina. To do this, subjects are ranked in ascending order of results and from them they are formed into groups with a number of individuals proportional to certain frequencies of assessments in the normal distribution of test results.

WALLS

When transforming grades into a scale stans (from the English standsrt ten - standard ten) a similar procedure is carried out with the only difference that this scale is based on ten standard intervals.

The results of examining the professional abilities of the subjects are entered into special scales, which make it possible to subsequently use psychometric tools for a scientifically based conclusion about the preferences of a particular candidate for a vacant position.

Measurement is the transformation of certain properties and qualities into known, easily interpreted and processed units called numbers. Measurement is the assignment of numbers to the properties and qualities of subjects and objects in accordance with certain rules. A scale is a form of recording a set of characteristics of the object being studied and ordering them into a certain numerical system.

I. Measuring scales are a form of fixation and a method of ordering the totality of signs of the psychological phenomena or processes being studied into a certain numerical system. The use of scales is associated with the need for qualitative and quantitative assessment

(with the task of subsequent comparison) of certain characteristics and variables.

Traits and variables are measurable psychological phenomena. Such phenomena can be: time to solve a problem, the number of mistakes made, the level of anxiety, an indicator of intellectual lability, an indicator of sociometric status, etc.

Measurements in psychological research are not an end in themselves, they are a way of obtaining new additional information, and it is needed to describe the psychological phenomena or processes being studied, predicting the directions and trends of their possible change.

The sequence of work of a psychologist studying specific psychological phenomena or processes through statistical processing of empirical material, systematization and analysis of empirical (experimental) data seems to be as follows: first of all, it is necessary to clearly identify the properties and qualities being studied (for example, to give an accurate definition of a particular character trait being studied , professionally important quality of a person); choose reliably distinguishable gradations (signs) of these properties, i.e. set the units of measurement for this property; to assign numbers to the qualities under study or their properties (taken as a unit of measurement), which will allow either classifying, ordering the measured objects according to the specified properties, or ranking them according to the degree of expression of these properties. For this purpose, various statistical quantities are used: conditional scores, significance ranks of the studied quantities, factor “weights”, etc.; measure, based on selected units of counting, the property or quality being studied; carry out statistical processing of the obtained psychological indicators.

The results of statistical material collected on the subject of the survey must be properly analyzed from methodological and psychological positions. To do this, it is necessary to establish the type of measuring scale and the permissible transformations of the statistical values ​​included in it.

The classification of measurement scales is based on the sign of metric determinism by the American psychologist S.S. Stevens. In accordance with this feature, measurement scales are usually divided into non-metric (name scales, order scales) and metric (interval scales, ratio scales).

Scales of qualitative characteristics.

1. In the naming scale (another name is nominative), when recording qualitative information, it is acceptable to establish the corresponding attribute for a particular class. An example of a nominative scale is a dichotomous scale, consisting of only two cells, for example: an expert “voted for” or “against”. A sign that changes along a dichotomous scale of names is called alternative. A more complex version of the nominative scale is classification of three or more cells, for example: “selection of a candidate

A - candidates B - candidates C - candidates D". In this case, a statistical connection can be established between groups of characteristics (correlation analysis). However, there may be no relationship between the measured characteristics (Table 11).

Table 11

Example of a naming scale

Managers

Leadership style

democratic

liberal

To analyze the relationship between data measured on a scale of names, the following correlation coefficients are most often used: a) coefficients 2 ? 2 (4) cell conjugacy (contingency coefficient Q; association coefficient?); b) coefficients m x n (multicellular) conjugacy (Pearson's mutual conjugation coefficient C; Chuprov's mutual conjugation coefficient K).

When identifying distributions in classes, it is possible to determine the absolute and relative frequencies of occurrence of characteristics, determine the mode and median.

1. In the order scale, it is permissible to divide the set of characteristics into elements connected by the relationships: “more-less” (Table 12).

Table 12

Example of an order scale

Result

Back

Ability to manage yourself

Inability to manage oneself

Clear personal values

Blurred personal values

Clear personal goals

Vague personal goals

Continuing self-development

Stopped self-development

Good problem solving skills

Lack of such skills

Creativity

Lack of creativity

Ability to influence others

Inability to influence others

Expert assessments are most often presented on an ordinal scale, since, for example, during an expert survey, it is easier for a specialist to answer questions of a qualitative, comparative nature (Ivanov is preferable to Petrov) than quantitative ones. When statistically processing empirical material, it is possible to determine the median of the distribution and calculate rank correlation coefficients.

The ordinal scale must have at least three classes, for example, “positive reaction - neutral reaction - negative reaction” or “suitable for a vacant position - suitable with reservations - not suitable” or X A = X B; X A< Х В; Х А >X V.

Quantitative trait scales are interval scales and ratio scales.

2. The interval scale is a scale that orders, classifies and evaluates characteristics according to the severity of the characteristic being measured, in comparison with a certain interval (standard) according to the principle “more by a certain number of units - less by a certain number of units.” Intervals can determine the levels of development of a particular psychological parameter being measured. The zero reference point can be set arbitrarily (Table 13).

Table 13

Example of an interval scale

Subjects

IQ

intelligence

Degree of compliance

profession requirements

Does not match

Below the average

Does not match

corresponds

Above average

corresponds

Outstanding

corresponds

The standard deviation is used as an interval in this scale. Interval features can be: time to solve a problem, based on converting raw scores into standard deviation units; standard indicators: IQ, T - scores, percentiles, etc.

Acceptable transformations: calculations of arithmetic averages, standard deviations; correlation coefficients of two variables (Spearman correlation coefficient r s; Goodman and Kruskal measure; Kendall measure; Somers d measure; COV - covariance; Pearson linear correlation coefficient r xy; correlation coefficient of several variables: concordance coefficient W;

4. In the ratio scale, features are classified in proportion to the degree of expression of the property being measured, and numerical values ​​are assigned to the measured features based on the principle of similarity, proportionality, equality-inequality, etc. In the ratio scale there is a significant zero point, which indicates the complete absence of the measured property, quality , and the features have a numerical proportional relationship (for example, 2 is to 4, as 4 is to 8, etc.).

Note. The capabilities of the human psyche are so great that it is difficult to imagine absolute zero in any measurable psychological variable. Absolute stupidity and absolute honesty are concepts rather of everyday psychology. The same applies to the establishment of equal relations: only the metaphor of everyday speech allows for Ivanov to be 2 times (3, 5, 10) smarter than Petrov or vice versa.

Acceptable transformations: in relation to frequency indicators, it is possible to apply all arithmetic operations; the unit of measurement in this relationship scale is 1 observation, 1 choice, 1 reaction, etc.

Sometimes in one survey it is necessary to present the results on different scales. We will see this in the next example (Table 14).

Table 14

Correlation of verbal thinking assessment results expressed in different scales

Numbers

subjects

Interval estimates

Ranked

assessments

Nominal ratings

Scale type:

Interval

order

items

According to the form of recording empirical data, measurement scales are divided into: verbal, numerical, graphic.

Verbal scales are a form of recording judgments about the presence (yes - no) or degree of expression (including in the form of polar definitions) of the characteristic being studied (for example, extrovert - introvert, etc.).

In numerical scales, the data measured in the survey are presented using numerical values, which is the most convenient for recording and statistical processing of empirical material.

Graphic scales allow you to clearly display the dynamics of the development of the measured characteristic on the abscissa and ordinate axes and see the trends in its change (Fig. 16).

Rice. 16. Histogram

A histogram is a graph in the form of a sequence of bars, each of which is based on one digit interval, and its height reflects the number of cases, or frequency, in that digit.

Graphical presentation of data can be in the form of a bar or pie chart or a histogram (Fig. 17).

Rice. 17. Bar and pie charts of the probability distribution of classified events

Scale assessments are a way of assessing test results by establishing its place on a special scale. In psychodiagnostics, various forms of assessing test results are used by correlating them with group data and establishing its place on a special scale.

One of the most common scoring methods is the percentile. The percentile reflects the percentage of individuals in the range of rank gradations from 1 to 100, where the 50th percentile corresponds to the median (Me). The following formula is used to determine the percentile:

where f com is the accumulated frequency of scores that are less than the observed score for which the percentile is calculated; f is the frequency of the converted estimate; N is the total number of ratings (Fig. 18).

Rice. 18. Distribution of results in percentiles

The disadvantage of percentile scales is the unevenness of the units of measurement. With a normal distribution, most test results are grouped in the center of the distribution and scatter as they move towards the edges.

In order to overcome this shortcoming, test scores are standardized, which makes it possible to compare the results of different subjects in terms of indicators that are comparable to the sample.

Z-scores are the ratio of the difference between the X value and the average value to the standard deviation (Fig. 19).

Rice. 19. Distribution of results in Z-scores

Conversion of test results into Z-scores is carried out using the formula:

where X 1 is the individual result of the subject, is the arithmetic mean, ? - standard deviation.

The disadvantage of Z-scores is the presence of negative values ​​and fractions (Fig. 20).

T – points

Rice. 20. Distribution of results in T-scores

T - scores are a normal distribution of scores with a mean of 50 and a standard deviation of 10. If the distribution of observed scores is normal, the conversion is made using the formula:

where X is the observed score; M is the average value of the observed estimates; ? x is the standard deviation of the observed scores.

If the observed scores do not obey the normal distribution, then they are converted into percentiles, then according to the normal distribution table into Z-scores, for which the formula is used: T = 10 z + 50 (Table 15).

Table 15

Relationship between percentiles, Z-scores and T-scores

Percentile

T-score

Percentile

T-score

The results of the subjects can be reflected in Stans (Fig. 21).

Rice. 21. Stan Scale

The Stan scale is used to standardize psychological indicators that have a small number of qualitatively distinguishable gradations.

Stans are units on a ten-point scale with a mean of 5.5 and a standard deviation of 2. To convert absolute scores into stans, the formula is used:

Where? c is the standard deviation of the Stan scale, equal to 2; ? x is the standard deviation of the method indicators in the standardization sample; Хi - current value of the indicator;

Mx is the average value of the method indicators in the standardization sample; Mc - average value of the Stan scale equal to 5.5;

The Stenine scale is a generally accepted linear transformation of indicators, in which scores take values ​​from 1 to 9, the mean is 5.0, and the standard deviation? = 2.0 (Fig. 22).

Rice. 22. Stenine scale

The subjects are ranked in ascending order of results, and groups are formed from them with a number of individuals corresponding to certain frequencies of assessments in the normal distribution of test results.

Primary indicators are converted into stenines by ordering their numerical values ​​in accordance with the normal curve of the percentage distribution of primary estimates given in Table 16.

Table 16

Conversion to stenines

Percentage of respondents in the standardization sample

The lowest and highest scores will be assigned to results 1 and 9.

To compare the results of measuring the same psychological indicator (trait), after bringing the result to any unified measurement scale (for example, the stan scale), O.P. Eliseev proposed a formula for recalculating the results and displaying them on a single scale of 20-80.

Recalculation of test results into table 20-80 is carried out for each studied parameter separately using the following formula:

Where: - Raw points – the number of points received for solving each subtest separately, and the total result for the entire test; 60 - visible scale range 20-80; Maximum – the maximum possible number of points that the test taker can score (for each subtest and for the test as a whole); 20 - invisible scale range 20-80 (Fig. 23).

Rice. 23. Histogram of the results of the “SHTUR” test

These are the basic psychometric statistical processing procedures that allow us to obtain additional information about the characteristics and trends of the survey results.

information when testing a student – ​​his primary (“raw”) scores. They are clear, simple, but significantly depend, for example, on the difficulty of the tasks. A more objective scale for assessing the preparedness of students is needed; it is necessary to confirm the level of training on various tests, with a predetermined level of difficulty of tasks.

You should also get rid of the non-linearity of the primary scores in relation to the level of preparedness.

Example. The grading scale at school allows us to conclude only that student Ivanov studies better than student Petrov. What are their differences, successes, efforts, etc.? – This scale does not answer such questions. Likewise, raw scores only rank test takers.

In such ordinal scales, the main statistics are median, quantiles and rank correlation.

Positioning subjects on the numerical axis according to the test results is carried out in different ways. Therefore, different types of rating scales are used, such as the following.

Final rating scale– the scale, determined by the minimum and maximum scores (points), is a linear transformation of the segment from the minimum to the maximum score; for example, the scale is 100 points.

Standard scale– scale introduced on the basis of the validity of the hypothesis about the normal law of distribution of points; for example, translation into a normative scale assumes that the knowledge of subjects in their random sample is subject to a normal distribution law, therefore, equal segments under the normal distribution curve correspond to equal numbers of correct answers.

Ordinal, qualitative, relationship scale– a scale for introducing order relations into the set of scalable objects, systems and performing all transformations that do not violate this order rule; for example, the grading scale in secondary school is 2, 3, 4, 5 and in high school - “unsatisfactory”, “satisfactory”, “good”, “excellent”.

Nominal scale (items) used by experts when classifying empirical objects of measurement. This scale is used when a pedagogical measurement groups students without establishing the order of the groups; for example, dividing students into groups of those who passed and those who did not pass the test.

Example. If a test taker receives 1 (0) for a correct (incorrect) answer to a task, then the test results are presented on a nominal scale.

Interval scale– scale, in which only linear transformation functions are allowed, and in which it is often impossible to mark either the beginning, the end, or the unit of measurement (gradation) of the scale; for example, the temperature scales Fahrenheit and Celsius are related by the relationship: C = 5/9 (F – 32), C – temperature (in degrees) on the Celsius scale, F – temperature on the Fahrenheit scale.

Interval scale is a quantitative scale for ordering data (objects) according to the relations of equivalence, order and additivity. It defines a metric (origin, unit of measurement and the concept of distance between data and objects), so the problem of comparing testing results is solved.

Qualitative scales have low measurement accuracy, while quantitative ones have higher objectivity.

The structure of measurement types and levels is shown in Fig. 6.1.


Rice. 6.1.

The logit scale, often used in testology, is usually translated into a test score scale.

Example. If a Unified State Exam participant has not completed a single task and received 0 primary points, he receives zero test points, but if he has completed all tasks and received the highest possible primary score, he receives 100 test points. The test scores of the remaining USE participants are calculated using a linear transformation that transforms a segment of the logit scale limited by a score in logits corresponding to one primary score and a score in logits corresponding to a primary score that is one unit less than the maximum possible into a segment on the test point scale from six to ninety-four inclusive. For example, the formula for converting the logit scale to the test score scale may look like:

where T is the test score, x is the assessment of the level of preparedness of the Unified State Exam participant in logits, x min is the score in logits corresponding to one primary score, x max is the score in logits corresponding to the primary score, one less than the maximum possible score, [x ] – integer part of x.

In normatively oriented tests, the task is to determine the rating of test takers in a group. This place, naturally, depends on the “background” - the group. Standards are used that reflect the test results for a representative sample of subjects.

Example. Typically, for a qualitative test of this kind, about 70% of the results are located in the center of the distribution ("under the bell" of the distribution curve) and have a small measurement error, approximately 5% (of the weakest and strongest results) in the flat part of the distribution curve, they can have a very large measurement error. Professional testing, during processing, discards these ends or parts thereof.

In criterion-based tests, the task is set: to compare the educational achievements of each test subject with the amount of knowledge (skills, skills) planned for assimilation. This depends more on the specific content of the State Standards (program) being tested.

To eliminate the dependence of the interpretation of the test result on the results in the group of test participants, empirically established norms for test performance are used, with which the primary scores of a particular test taker are compared. This is the process of standardizing a test, for example by the mean and standard deviation of individual scores.

Commonly used raw score conversions:

  • percentile, reflecting the percentage of subjects from the normative group whose results are not higher than a given value of the primary score;
  • Z-score, linear assessment - the ratio of the individual deviation of test scores to the standard deviation of the scores of the entire group of subjects, as well as linear transformations of the Z-score (T-scale, etc.);
  • Stanine and Wall scales (Cattell scale), obtained by dividing the primary score scale into a number of intervals.

Percentiles establish the rank of a subject’s indicator in the normative group, showing the percentage of subjects in the normative sample who have results no higher than these primary scores. The percentile scale is non-linear (the response to a one-point change in the raw score scale is non-linear), so it may even distort the real situation.

The so-called Z-scale translates individual results into a standard scale, which is characterized by two main general parameters: the average score and the variance. The Z-score of the i-th tested person is found using the formula:

where x i are the subject’s primary scores; –

Psychodiagnostics: lecture notes Alexey Sergeevich Luchinin

2. Scale ratings

2. Scale ratings

Scale ratings– a method of assessing a test result by establishing its place on a special scale. The scale contains data on intragroup norms for performing this technique in the standardization sample. Thus, individual results of completing tasks (primary assessments of subjects) are compared with data in a comparable normative group (for example, the result achieved by a student is compared with the indicators of children of the same age or year of study; the result of a study of the general abilities of an adult is compared with statistically processed indicators of a representative sample of individuals within specified age limits).

Scale scores in this sense have a clearly defined quantitative content and can be used in statistical analysis. One of the most common forms of assessing a test result in psychological diagnostics by correlation with group data is the calculation percentiles.

Percentile is the percentage of individuals from the standardization sample whose results are lower than a given primary indicator. The percentile scale can be considered as a set of rank gradations (see rank correlation) with the number of ranks being 100 and starting from the 1st rank, corresponding to the lowest result; The 50th percentile (PSQ) corresponds to the median (see measures of central tendency) of the performance distribution, P ›50 and P ‹50 respectively representing the ranks of performance above and below the median performance level.

Percentiles should not be confused with regular percentages. The latter represent the proportion of correct decisions out of the total number of test items in the individual result (see primary scores). Rank P and P 100 receive, respectively, the lowest and highest results from those observed in the sample, however, these ranks may also correspond to far from zero (not a single correct solution) or absolute (all solutions are correct) indicators (for example, with a total number of 120 tasks the minimum result corresponding to the first rank can be 6 correct solutions, while the maximum result corresponding to the P rank 100 will be 95 correctly solved tasks). This situation occurs, for example, when evaluating speed tests.

The main disadvantage of percentile scales is the unevenness of units of measurement. In a normal distribution, individual variables are tightly grouped in the center of the distribution and scatter as they move toward the edges. Therefore, equal frequencies of cases near the center correspond to shorter intervals along the x-axis, located at the edges of the distribution of estimates. Percentiles show the relative position of each subject in the normal sample, but not the magnitude of the differences between the results. This creates some inconvenience in interpreting individual results. Thus, the difference in primary indicators corresponding to the interval P 70 -P 80 can be 10 points, and the difference in the number of correct decisions in the interval of ranks P 50 -P 60 can be only 1–3 points.

At the same time, percentile scores also have a number of advantages. They are easily understood by users of psychodiagnostic information, are universal in relation to various types of techniques and are easy to calculate.

Percentile scores are not typical scale scores. More widely used in psychodiagnostics standard indicators, calculated on the basis of linear and nonlinear transformation of primary indicators distributed according to a normal or close to normal law. With this calculation, an r-transformation of estimates is carried out (see standardization, normal distribution). To determine the 2-standard indicator, determine the difference between the individual primary result and the mean for the normal group, and then divide this difference by the a of the normative sample. The z scale obtained in this way has a midpoint M = 0, negative values ​​indicate results below average and decrease as they move away from the zero point; Positive values ​​indicate results above average. The unit of measurement (scale) in the z scale is equal to 1a of the standard (unit) normal distribution.

To transform the distribution of primary normative results obtained during standardization into a standard z-scale, it is necessary to investigate the question of the nature of the empirical distribution and the degree of its consistency with the normal one. Since for most cases the values ​​of the indicators in the distribution fit within M ± 3?, the units of the simple z-scale are too large. For ease of estimation, another transformation of the type z = (x – ‹x›) / ? is used. An example of such a scale would be the assessments of the test battery SAT (SEEB) methodology for assessing learning ability (see achievement tests). This r-scale is recalculated so that the midpoint is 500, huh? = 100. Another similar example is the Wechsler scale for individual subtests (see Wechsler intelligence scale, where M = 10, ? = 3).

Along with determining the place of an individual result in the standard distribution of group data, the introduction of SS is also aimed at achieving another important goal - ensuring the comparability of the quantitative results of various tests expressed in standard scales, the possibility of their joint interpretation, and reducing assessments to a single system.

If both distributions of estimates in the compared methods are close to normal, the issue of comparability of estimates is resolved quite simply (in any normal distribution, the intervals M ± n? correspond to the same frequency of cases). To ensure comparability of results belonging to distributions of a different form, apply nonlinear transformations, allowing you to give the distribution the shape of a given theoretical curve. The normal distribution is usually used as such a curve. Like the 160–150 in the simple z-transform, normalized standard scores can be given any desired shape. For example, multiplying such a normalized standard indicator by 10 and adding the constant 50, we get T-score(see standardization, Minnesota Multidimensional Personality Inventory).

An example of a nonlinearly converted to a standard scale is and stanine scale(from the English standard nine - “standard nine”), where ratings take values ​​from 1 to 9, M = 5, ? = 2.

The stanine scale is becoming increasingly widespread, combining the advantages of standard scale indicators and the simplicity of percentiles. Primary indicators are easily converted into stanina. To do this, subjects are ranked in ascending order of results and from them they are formed into groups with a number of individuals proportional to certain frequencies of assessments in the normal distribution of test results (Table 14).

Table 14

Translation of primary test results into the stanine scale

When transforming grades into a scale stans(from the English standard ten - “standard ten”) a similar procedure is carried out with the only difference being that this scale is based on ten standard intervals. Let there be 200 people in the standardization sample, then 8 (4%) subjects with the lowest and highest scores will be assigned to 1 and 9 stanines, respectively. The procedure continues until all scale intervals are filled. The test scores corresponding to the percentage gradations will thus be ordered into a scale corresponding to the standard frequency distribution of the result.

One of the most common forms of scale ratings in intelligence tests is standard IQ score(M = = 100, ? = 16). These parameters for the standard rating scale in psychodiagnostics were chosen as reference. There are quite a few scales that rely on standardization; their estimates are easily reducible to each other. Scaling, in principle, is acceptable and desirable for a wide range of techniques used for diagnostic and research purposes, including for techniques whose results are expressed in qualitative indicators. In this case, for standardization, you can use the translation of nominative scales into rank scales (see measurement scales) or develop a differentiated system of quantitative primary assessments.

It should be noted that, despite their simplicity and clarity, scale indicators are statistical characteristics that only allow one to indicate the place of a given result in a sample of many measurements of a similar nature. A scale score, even for a traditional psychometric instrument, is only one form of expression of test scores used in interpreting survey results. In this case, a quantitative analysis should always be carried out in conjunction with a multilateral qualitative study of the reasons for the occurrence of a given test result, taking into account both a complex of information about the personality of the subject and data on the current conditions of the examination, the reliability and validity of the methodology. Exaggerated ideas about the possibility of valid conclusions based only on quantitative estimates led to many erroneous ideas in the theory and practice of psychological diagnostics.

From the book Medical Statistics author Olga Ivanovna Zhidkova

22. Methodology for group assessment of physical development. Acceleration Assessment of the physical development of the team is carried out by analyzing age-related changes in the average values ​​of their standard deviations, annual increases in indicators at various age

From the book Propaedeutics of Childhood Illnesses by O. V. Osipova

8. Central method for assessing physical development Considering the observed variation in various indicators of a child’s physical development, it is necessary to know the so-called normal, or Gaussian-Laplacian, distribution. The characteristics of this distribution are

From the book General Hygiene author Yuri Yuryevich Eliseev

50. Methods for assessing the physical development of children and adolescents Method of sigma deviations Indicators of an individual’s development are compared with the average indicators characteristic of the corresponding age-sex group, the difference between them is expressed in shares

From the book General Hygiene: Lecture Notes author Yuri Yuryevich Eliseev

51. Methods for assessing the physical development of children and adolescents (continued) At the second stage, the morphofunctional state is determined by indicators of body weight, chest circumference during the respiratory pause, muscle strength of the hands and vital capacity of the lungs (VC). As

From the book Propaedeutics of Childhood Illnesses: Lecture Notes by O. V. Osipova

LECTURE No. 14. Physical development of children and adolescents, methods of assessing them Indicators of physical development For a complete picture of the health status of the younger generation, in addition to morbidity and demographic data, it is also necessary to study the leading criterion

From the book Development of basic cognitive functions using adaptive play activities author Irina Konstantinova

Methods for assessing the physical development of children and adolescents When developing and selecting methods for assessing physical development, it is necessary, first of all, to take into account the basic patterns of the physical development of a growing organism: 1) heteromorphism and heterochrony of development; 2)

From the book Slimness from childhood: how to give your child a beautiful figure by Aman Atilov

5. Indicators of physical development of children. Centile method for assessing physical development Considering the observed variation in various indicators of a child’s physical development, it is necessary to know the so-called normal, or Gaussian-Laplacian, distribution.

From the book Your Child from Birth to 6 Years. Identification of developmental deviations and their correction. A book every family needs author Leonid Rostislavovich Bitterlikh

Methods for assessing the effectiveness of work When working with children with severe developmental disorders, a qualitative analysis of the disorders and the ongoing dynamics is used. So, for example, for a specialist, the number of words a child has mastered is not so important as his ability

From the book Formation of children's health in preschool institutions author Alexander Georgievich Shvetsov

Criteria for assessing flexibility To determine the effectiveness of the educational and training process, it is necessary to use criteria for assessing flexibility, taking into account its varieties and manifestations. Each manifestation of flexibility must meet certain evaluation criteria.

From the book Psychotherapy of Family and Sexual Disharmonies author Stanislav Kratochvil

Ways to Assess Correct Development of a Preemie and Twins You can use the two methods below to check whether your premature baby is developmentally behind a full-term baby or not.

From the book Encyclopedia of Amosov. Health algorithm author Nikolai Mikhailovich Amosov

Methodology for assessing the physical development of children Currently, the most common way to assess physical development is the method of interrelating anthropometric characteristics (using regression scales), which ensure their harmony and proportionality

From the book Dietetics: A Guide author Team of authors

From the book My stroke was a science to me. The story of your own illness, told by a neuroscientist author Jill Bolte Taylor

Expert assessments of the psyche In sociological surveys through newspapers in 1990, I tried to obtain a model of the personality of a particular respondent. To do this, he was asked questions with graduated answer options. They made it possible to determine points on models: claims, fees,

From the author's book

Comprehensive methods for assessing nutritional status As follows from the above, there are no separate markers that can identify the presence and degree of protein-energy malnutrition. According to the recommendations of the European Society of Parenteral and

From the author's book

Appendix A Ten questions to assess my condition 1. Have you checked my vision and hearing to make sure that I have not lost my hearing and vision?2. Can I distinguish colors?3. Do I see the world in three dimensions?4. Do I have any sense of time?5. Are all parts of my body me