What is an interval data series? Construction of interval variation series for continuous quantitative data

Math statistics- a branch of mathematics devoted to mathematical methods of processing, systematizing and using statistical data for scientific and practical conclusions.

3.1. BASIC CONCEPTS OF MATHEMATICAL STATISTICS

In medical and biological problems, it is often necessary to study the distribution of a particular characteristic for a very large number of individuals. This trait has different meanings for different individuals, so it is a random variable. For example, any therapeutic drug has different effectiveness when applied to different patients. However, in order to get an idea of ​​​​the effectiveness of this drug, there is no need to apply it to everyone sick. It is possible to trace the results of using the drug to a relatively small group of patients and, based on the data obtained, identify the essential features (efficacy, contraindications) of the treatment process.

Population- a set of homogeneous elements characterized by some attribute to be studied. This sign is continuous random variable with distribution density f(x).

For example, if we are interested in the prevalence of a disease in a certain region, then the general population is the entire population of the region. If we want to find out the susceptibility of men and women to this disease separately, then we should consider two general populations.

To study the properties of a general population, a certain part of its elements is selected.

Sample- part of the general population selected for examination (treatment).

If this does not cause confusion, then a sample is called as a set of objects, selected for the survey, and totality

values the studied characteristic obtained during the examination. These values ​​can be represented in several ways.

Simple statistical series - values ​​of the characteristic being studied, recorded in the order in which they were obtained.

An example of a simple statistical series obtained by measuring the surface wave velocity (m/s) in the skin of the forehead in 20 patients is given in Table. 3.1.

Table 3.1.Simple statistical series

A simple statistical series is the main and most complete way of recording survey results. It can contain hundreds of elements. It is very difficult to take a look at such a totality at one glance. Therefore, large samples are usually divided into groups. To do this, the area of ​​change in the characteristic is divided into several (N) intervals equal width and calculate the relative frequencies (n/n) of the attribute falling into these intervals. The width of each interval is:

The interval boundaries have the following meanings:

If any sample element is the boundary between two adjacent intervals, then it is classified as left interval. Data grouped in this way is called interval statistical series.

is a table that shows intervals of attribute values ​​and the relative frequencies of occurrence of the attribute within these intervals.

In our case, we can form, for example, the following interval statistical series (N = 5, d= 4), table. 3.2.

Table 3.2.Interval statistical series

Here, the interval 28-32 includes two values ​​equal to 28 (Table 3.1), and the interval 32-36 includes values ​​32, 33, 34 and 35.

An interval statistical series can be depicted graphically. To do this, intervals of attribute values ​​are plotted along the abscissa axis and on each of them, as on a base, a rectangle is built with a height equal to the relative frequency. The resulting bar chart is called histogram.

Rice. 3.1. bar chart

In the histogram, the statistical patterns of the distribution of the characteristic are visible quite clearly.

With a large sample size (several thousand) and small column widths, the shape of the histogram is close to the shape of the graph distribution density sign.

The number of histogram columns can be selected using the following formula:

Constructing a histogram manually is a long process. Therefore, computer programs have been developed to automatically construct them.

3.2. NUMERIC CHARACTERISTICS OF STATISTICAL SERIES

Many statistical procedures use sample estimates for the population expectation and variance (or MSE).

Sample mean(X) is the arithmetic mean of all elements of a simple statistical series:

For our example X= 37.05 (m/s).

The sample mean isthe bestgeneral average estimateM.

Sample variance s 2 equal to the sum of squared deviations of elements from the sample mean, divided by n- 1:

In our example, s 2 = 25.2 (m/s) 2.

Please note that when calculating the sample variance, the denominator of the formula is not the sample size n, but n-1. This is due to the fact that when calculating deviations in formula (3.3), instead of the unknown mathematical expectation, its estimate is used - sample mean.

Sample variance is the best estimation of general variance (σ 2).

Sample standard deviation(s) is the square root of the sample variance:

For our example s= 5.02 (m/s).

Selective root mean square deviation is the best estimate of the general standard deviation (σ).

With an unlimited increase in sample size, all sample characteristics tend to the corresponding characteristics of the general population.

Computer formulas are used to calculate sample characteristics. In Excel, these calculations perform the statistical functions AVERAGE, VARIANCE. STANDARD DEVIATION

3.3. INTERVAL ASSESSMENT

All sample characteristics are random variables. This means that for another sample of the same size, the values ​​of the sample characteristics will be different. Thus, selective

characteristics are only estimates relevant characteristics of the population.

The disadvantages of selective assessment are compensated by interval estimation, representing numeric interval inside which with a given probability R d the true value of the estimated parameter is found.

Let U r - some parameter of the general population (general mean, general variance, etc.).

Interval estimation parameter U r is called the interval (U 1, U 2), satisfying the condition:

P(U < Ur < U2) = Рд. (3.5)

Probability R d called confidence probability.

Confidence probability Pd - the probability that the true value of the estimated quantity is inside the specified interval.

In this case, the interval (U 1, U 2) called confidence interval for the parameter being estimated.

Often, instead of the confidence probability, the associated value α = 1 - Р d is used, which is called level of significance.

Significance level is the probability that the true value of the estimated parameter is outside confidence interval.

Sometimes α and P d are expressed as percentages, for example, 5% instead of 0.05 and 95% instead of 0.95.

In interval estimation, first select the appropriate confidence probability(usually 0.95 or 0.99), and then find the appropriate range of values ​​for the parameter being estimated.

Let us note some general properties of interval estimates.

1. The lower the level of significance (the more R d), the wider the interval estimate. So, if at a significance level of 0.05 the interval estimate of the general mean is 34.7< M< 39,4, то для уровня 0,01 она будет гораздо шире: 33,85 < M< 40,25.

2. The larger the sample size n, the narrower the interval estimate with the selected significance level. Let, for example, 5 be the percentage estimate of the general average (β = 0.05) obtained from a sample of 20 elements, then 34.7< M< 39,4.

By increasing the sample size to 80, we get a more accurate estimate at the same significance level: 35.5< M< 38,6.

In general, the construction of reliable confidence estimates requires knowledge of the law according to which the estimated random attribute is distributed in the population. Let's look at how an interval estimate is constructed general average characteristic that is distributed in the population according to normal law.

3.4. INTERVAL ESTIMATION OF THE GENERAL AVERAGE FOR THE NORMAL DISTRIBUTION LAW

The construction of an interval estimate of the general average M for a population with a normal distribution law is based on the following property. For sampling volume n attitude

obeys the Student distribution with the number of degrees of freedom ν = n- 1.

Here X- sample mean, and s- selective standard deviation.

Using Student distribution tables or their computer equivalent, you can find a boundary value such that, with a given confidence probability, the following inequality holds:

This inequality corresponds to the inequality for M:

Where ε - half-width of the confidence interval.

Thus, the construction of a confidence interval for M is carried out in the following sequence.

1. Select a confidence probability Р d (usually 0.95 or 0.99) and for it, using the Student distribution table, find the parameter t

2. Calculate the half-width of the confidence interval ε:

3. Obtain an interval estimate of the general average with the selected confidence probability:

Briefly it is written like this:

Computer procedures have been developed to find interval estimates.

Let us explain how to use the Student distribution table. This table has two “entrances”: the left column, called the number of degrees of freedom ν = n- 1, and the top line is the significance level α. At the intersection of the corresponding row and column, find the Student coefficient t.

Let's apply this method to our sample. A fragment of the Student distribution table is presented below.

Table 3.3. Fragment of the Student distribution table

A simple statistical series for a sample of 20 people (n= 20, ν =19) is presented in table. 3.1. For this series, calculations using formulas (3.1-3.3) give: X= 37,05; s= 5,02.

Let's choose α = 0.05 (Р d = 0.95). At the intersection of row “19” and column “0.05” we find t= 2,09.

Let us calculate the accuracy of the estimate using formula (3.6): ε = 2.09?5.02/λ /20 = 2.34.

Let's construct an interval estimate: with a probability of 95%, the unknown general mean satisfies the inequality:

37,05 - 2,34 < M< 37,05 + 2,34, или M= 37.05 ± 2.34 (m/s), R d = 0.95.

3.5. METHODS FOR TESTING STATISTICAL HYPOTHESES

Statistical hypotheses

Before formulating what a statistical hypothesis is, consider the following example.

To compare two methods of treating a certain disease, two groups of patients of 20 people each were selected and treated using these methods. For each patient it was recorded number of procedures, after which a positive effect was achieved. Based on these data, sample means (X), sample variances were found for each group (s 2) and sample standard deviations (s).

The results are presented in table. 3.4.

Table 3.4

The number of procedures required to obtain a positive effect is a random variable, all information about which is currently contained in the given sample.

From the table 3.4 shows that the sample average in the first group is less than in the second. Does this mean that the same relationship holds for general averages: M 1< М 2 ? Достаточно ли статистических данных для такого вывода? Ответы на эти вопросы и дает statistical testing of hypotheses.

Statistical hypothesis- it is an assumption about the properties of populations.

We will consider hypotheses about the properties two general populations.

If populations have famous, identical distribution of the value being estimated, and the assumptions concern the values some parameter of this distribution, then the hypotheses are called parametric. For example, samples are drawn from populations with normal law distribution and equal variance. Need to find out are they the same general averages of these populations.

If nothing is known about the laws of distribution of general populations, then hypotheses about their properties are called nonparametric. For example, are they the same laws of distribution of the general populations from which the samples are drawn.

Null and alternative hypotheses.

The task of testing hypotheses. Significance level

Let's get acquainted with the terminology used when testing hypotheses.

H 0 - null hypothesis (skeptic's hypothesis) is a hypothesis about the absence of differences between compared samples. The skeptic believes that the differences between sample estimates obtained from research results are random;

H 1- alternative hypothesis (optimist hypothesis) is a hypothesis about the presence of differences between the compared samples. An optimist believes that differences between sample estimates are caused by objective reasons and correspond to differences in general populations.

Testing statistical hypotheses is feasible only when it is possible to construct some size(criterion), the distribution law of which in case of fairness H 0 famous. Then for this quantity we can specify confidence interval, into which with a given probability R d its value falls. This interval is called critical area. If the criterion value falls into the critical region, then the hypothesis is accepted N 0. Otherwise, hypothesis H 1 is accepted.

In medical research, P d = 0.95 or P d = 0.99 are used. These values ​​correspond significance levelsα = 0.05 or α = 0.01.

When testing statistical hypotheseslevel of significance(α) is the probability of rejecting the null hypothesis when it is true.

Please note that, at its core, the hypothesis testing procedure is aimed at detecting differences and not to confirm their absence. When the criterion value goes beyond the critical region, we can say with a pure heart to the “skeptic” - well, what else do you want?! If there were no differences, then with a probability of 95% (or 99%) the calculated value would be within the specified limits. But no!..

Well, if the value of the criterion falls into the critical region, then there is no reason to believe that the hypothesis H 0 is correct. This most likely points to one of two possible reasons.

1. Sample sizes are not large enough to detect differences. It is likely that continued experimentation will bring success.

2. There are differences. But they are so small that they have no practical significance. In this case, continuing the experiments does not make sense.

Let's move on to consider some statistical hypotheses used in medical research.

3.6. TESTING HYPOTHESES ABOUT EQUALITY OF VARIANCES, FISCHER'S F-CRITERION

In some clinical studies, the positive effect is evidenced not so much magnitude of the parameter being studied, how much of it stabilization, reducing its fluctuations. In this case, the question arises about comparing two general variances based on the results of a sample survey. This problem can be solved using Fisher's test.

Formulation of the problem

normal law distributions. Sample sizes -

n 1 And n2, A sample variances equal s 1 and s 2 2 general variances.

Testable hypotheses:

H 0- general variances are the same;

H 1- general variances are different.

Shown if samples are drawn from populations with normal law distribution, then if the hypothesis is true H 0 the ratio of sample variances follows the Fisher distribution. Therefore, as a criterion for checking the fairness H 0 the value is taken F, calculated by the formula:

Where s 1 and s 2 are sample variances.

This ratio obeys the Fisher distribution with the number of degrees of freedom of the numerator ν 1 = n 1- 1 and the number of degrees of freedom of the denominator ν 2 = n 2 - 1. The boundaries of the critical region are found using Fisher distribution tables or using the computer function BRASPOBR.

For the example presented in table. 3.4, we get: ν 1 = ν 2 = 20 - 1 = 19; F= 2.16/4.05 = 0.53. At α = 0.05, the boundaries of the critical region are respectively: = 0.40, = 2.53.

The criterion value falls into the critical region, so the hypothesis is accepted H 0: general sample variances are the same.

3.7. TESTING HYPOTHESES REGARDING EQUALITY OF MEANS, STUDENT t-CRITERION

Comparison task average two general populations arises when practical significance is precisely magnitude the characteristic being studied. For example, when comparing the duration of treatment with two different methods or the number of complications arising from their use. In this case, you can use the Student's t-test.

Formulation of the problem

Two samples (X 1) and (X 2) were obtained, extracted from general populations with normal law distribution and identical variances. Sample sizes - n 1 and n 2, sample means are equal to X 1 and X 2, and sample variances- s 1 2 and s 2 2 respectively. Need to compare general averages.

Testable hypotheses:

H 0- general averages are the same;

H 1- general averages are different.

It is shown that if the hypothesis is true H 0 t value calculated by the formula:

distributed according to Student's law with the number of degrees of freedom ν = ν 1 + + ν2 - 2.

Here where ν 1 = n 1 - 1 - number of degrees of freedom for the first sample; ν 2 = n 2 - 1 - number of degrees of freedom for the second sample.

The boundaries of the critical region are found using t-distribution tables or using the computer function STUDRIST. The Student distribution is symmetrical about zero, so the left and right boundaries of the critical region are identical in magnitude and opposite in sign: -and

For the example presented in table. 3.4, we get:

ν 1 = ν 2 = 20 - 1 = 19; ν = 38, t= -2.51. At α = 0.05 = 2.02.

The criterion value goes beyond the left border of the critical region, so we accept the hypothesis H 1: general averages are different. At the same time, the population average first sample LESS.

Applicability of Student's t-test

The Student's t test is only applicable to samples from normal aggregates with identical general variances. If at least one of the conditions is violated, then the applicability of the criterion is questionable. The requirement of normality of the general population is usually ignored, citing central limit theorem. Indeed, the difference between sample means in the numerator (3.10) can be considered normally distributed for ν > 30. But the question of equality of variances cannot be verified, and references to the fact that the Fisher test did not detect differences cannot be taken into account. However, the t-test is widely used to detect differences in population means, although without sufficient evidence.

Below is discussed nonparametric criterion, which is successfully used for the same purposes and which does not require any normality, neither equality of variances.

3.8. NONPARAMETRIC COMPARISON OF TWO SAMPLES: MANN-WHITNEY CRITERION

Nonparametric tests are designed to detect differences in the distribution laws of two populations. Criteria that are sensitive to differences in general average, called criteria shift Criteria that are sensitive to differences in general dispersions, called criteria scale. The Mann-Whitney test refers to the criteria shift and is used to detect differences in the means of two populations, samples from which are presented in ranking scale. The measured characteristics are located on this scale in ascending order, and then numbered with integers 1, 2... These numbers are called ranks. Equal quantities are assigned equal ranks. It is not the value of the attribute itself that matters, but only ordinal place which it ranks among other quantities.

In table 3.5. the first group from Table 3.4 is presented in expanded form (line 1), ranked (line 2), and then the ranks of identical values ​​are replaced by arithmetic averages. For example, items 4 and 4 in the first row were given ranks 2 and 3, which were then replaced with the same values ​​of 2.5.

Table 3.5

Formulation of the problem

Independent samples (X 1) And (X 2) extracted from general populations with unknown distribution laws. Sample sizes n 1 And n 2 respectively. The values ​​of sample elements are presented in ranking scale. It is necessary to check whether these general populations differ from each other?

Testable hypotheses:

H 0- samples belong to the same general population; H 1- samples belong to different general populations.

To test such hypotheses, the (/-Mann-Whitney test is used.

First, a combined sample (X) is compiled from the two samples, the elements of which are ranked. Then the sum of the ranks corresponding to the elements of the first sample is found. This amount is the criterion for testing hypotheses.

U= Sum of ranks of the first sample. (3.11)

For independent samples whose volumes are greater than 20, the value U obeys the normal distribution, the mathematical expectation and standard deviation of which are equal to:

Therefore, the boundaries of the critical region are found according to normal distribution tables.

For the example presented in table. 3.4, we get: ν 1 = ν 2 = 20 - 1 = 19, U= 339, μ = 410, σ = 37. For α = 0.05 we get: left = 338 and right = 482.

The value of the criterion goes beyond the left border of the critical region, therefore hypothesis H 1 is accepted: general populations have different distribution laws. At the same time, the population average first sample LESS.

When constructing an interval distribution series, three questions are resolved:

  • 1. How many intervals should I take?
  • 2. What is the length of the intervals?
  • 3. What is the procedure for including population units within the boundaries of intervals?
  • 1. Number of intervals can be determined by Sturgess formula:

2. Interval length, or interval step, usually determined by the formula

Where R- range of variation.

3. The order of inclusion of population units within the boundaries of the interval

may be different, but when constructing an interval series, the distribution must be strictly defined.

For example, this: [), in which population units are included in the lower boundaries, but are not included in the upper boundaries, but are transferred to the next interval. The exception to this rule is the last interval, the upper limit of which includes the last number of the ranked series.

The interval boundaries are:

  • closed - with two extreme values ​​of the attribute;
  • open - with one extreme value of the attribute (before such and such a number or over such and such a number).

In order to assimilate the theoretical material, we introduce background information for solutions end-to-end task.

There are conditional data on the average number of sales managers, the quantity of similar goods sold by them, the individual market price for this product, as well as the sales volume of 30 companies in one of the regions of the Russian Federation in the first quarter of the reporting year (Table 2.1).

Table 2.1

Initial information for a cross-cutting task

Number

managers,

Price, thousand rubles

Sales volume, million rubles.

Number

managers,

Quantity of goods sold, pcs.

Price, thousand rubles

Sales volume, million rubles.

Based on the initial information, as well as additional information, we will set up individual tasks. Then we will present the methodology for solving them and the solutions themselves.

Cross-cutting task. Task 2.1

Using the initial data from table. 2.1 required construct a discrete series of distribution of firms by quantity of goods sold (Table 2.2).

Solution:

Table 2.2

Discrete series of distribution of firms by quantity of goods sold in one of the regions of the Russian Federation in the first quarter of the reporting year

Cross-cutting task. Task 2.2

required construct a ranked series of 30 firms according to the average number of managers.

Solution:

15; 17; 18; 20; 20; 20; 22; 22; 24; 25; 25; 25; 27; 27; 27; 28; 29; 30; 32; 32; 33; 33; 33; 34; 35; 35; 38; 39; 39; 45.

Cross-cutting task. Task 2.3

Using the initial data from table. 2.1, required:

  • 1. Construct an interval series of distribution of firms by number of managers.
  • 2. Calculate the frequencies of the distribution series of firms.
  • 3. Draw conclusions.

Solution:

Let's calculate using the Sturgess formula (2.5) number of intervals:

Thus, we take 6 intervals (groups).

Interval length, or interval step, calculate using the formula

Note. The order of inclusion of population units in the boundaries of the interval is as follows: I), in which population units are included in the lower boundaries, but are not included in the upper boundaries, but are transferred to the next interval. The exception to this rule is the last interval I ], the upper limit of which includes the last number of the ranked series.

We build an interval series (Table 2.3).

Interval series of distribution of firms and the average number of managers in one of the regions of the Russian Federation in the first quarter of the reporting year

Conclusion. The largest group of firms is the group with an average number of managers of 25-30 people, which includes 8 firms (27%); The smallest group with an average number of managers of 40-45 people includes only one company (3%).

Using the initial data from table. 2.1, as well as an interval series of distribution of firms by number of managers (Table 2.3), required build an analytical grouping of the relationship between the number of managers and the sales volume of firms and, based on it, draw a conclusion about the presence (or absence) of a relationship between these characteristics.

Solution:

Analytical grouping is based on factor characteristics. In our problem, the factor characteristic (x) is the number of managers, and the resultant characteristic (y) is the sales volume (Table 2.4).

Let's build now analytical grouping(Table 2.5).

Conclusion. Based on the data of the constructed analytical grouping, we can say that with an increase in the number of sales managers, the average sales volume of the company in the group also increases, which indicates the presence of a direct connection between these characteristics.

Table 2.4

Auxiliary table for constructing an analytical grouping

Number of managers, people,

Company number

Sales volume, million rubles, y

" = 59 f = 9.97

I-™ 4 - Yu.22

74 '25 1PY1

U4 = 7 = 10,61

at = ’ =10,31 30

Table 2.5

Dependence of sales volumes on the number of company managers in one of the regions of the Russian Federation in the first quarter of the reporting year

CONTROL QUESTIONS
  • 1. What is the essence of statistical observation?
  • 2. Name the stages of statistical observation.
  • 3. What are the organizational forms of statistical observation?
  • 4. Name the types of statistical observation.
  • 5. What is a statistical summary?
  • 6. Name the types of statistical reports.
  • 7. What is statistical grouping?
  • 8. Name the types of statistical groupings.
  • 9. What is a distribution series?
  • 10. Name the structural elements of the distribution row.
  • 11. What is the procedure for constructing a distribution series?

Having available statistical observation data characterizing a particular phenomenon, first of all it is necessary to organize them, i.e. give a systematic character

English statistician. UJReichman figuratively said about disordered collections that encountering a mass of ungeneralized data is equivalent to a situation where a person is thrown into a thicket without a compass. What is the systematization of statistical data in the form of distribution series?

The statistical series of distributions are ordered statistical aggregates (Table 17). The simplest type of statistical distribution series is a ranked series, i.e. a series of numbers in ascending or descending order, varying the characteristics. Such a series does not allow one to judge the patterns inherent in the distributed data: which value has the majority of indicators grouped, what deviations there are from this value; as well as the general distribution picture. For this purpose, data are grouped, showing how often individual observations occur in their total number (Scheme 1a 1).

. Table 17

. General view of statistical distribution series

. Scheme 1. Statistical scheme distribution series

The distribution of population units according to characteristics that do not have quantitative expression is called attributive series(for example, distribution of enterprises by their production area)

The series of distribution of population units according to characteristics, have a quantitative expression, are called variation series. In such series, the value of the characteristic (options) are in ascending or descending order

In the variational distribution series, two elements are distinguished: variant and frequency . Option- this is a separate meaning of the grouping characteristics frequency- a number that shows how many times each option occurs

In mathematical statistics, one more element of the variation series is calculated - partly. The latter is defined as the ratio of the frequency of cases of a given interval to the total sum of frequencies; the part is determined in fractions of a unit, percent (%) in ppm (%o)

Thus, a variation distribution series is a series in which the options are arranged in ascending or descending order, and their frequencies or frequencies are indicated. Variation series are discrete (intervals) and other intervals (continuous).

. Discrete variation series- these are distribution series in which the variant as the value of a quantitative characteristic can only take on a certain value. Options differ from each other by one or more units

Thus, the number of parts produced per shift by a specific worker can be expressed only by one specific number (6, 10, 12, etc.). An example of a discrete variation series could be the distribution of workers by the number of parts produced (Table 18 18).

. Table 18

. Discrete series distribution _

. Interval (continuous) variation series- such distribution series in which the value of the options are given in the form of intervals, i.e. the values ​​of the features can differ from each other by an arbitrarily small amount. When constructing a variation series of NEP peri-variant characteristics, it is impossible to indicate each value of the variant, so the population is distributed over intervals. The latter can be equal or unequal. For each of them, frequencies or frequencies are indicated (Table 1 9 19).

In interval distribution series with unequal intervals, mathematical characteristics such as distribution density and relative distribution density on a given interval are calculated. The first characteristic is determined by the ratio of frequency to the value of the same interval, the second - by the ratio of frequency to the value of the same interval. For the example above, the distribution density in the first interval will be 3: 5 = 0.6, and the relative density in this interval is 7.5: 5 = 1.55%.

. Table 19

. Interval distribution series _

A discrete variation series is constructed for discrete characteristics.

In order to construct a discrete variation series, you need to perform the following steps: 1) arrange the units of observation in increasing order of the studied value of the characteristic,

2) determine all possible values ​​of the attribute x i , arrange them in ascending order,

the value of the attribute, i .

frequency of attribute value and denote f i . The sum of all frequencies of a series is equal to the number of elements in the population being studied.

Example 1 .

List of grades received by students in exams: 3; 4; 3; 5; 4; 2; 2; 4; 4; 3; 5; 2; 4; 5; 4; 3; 4; 3; 3; 4; 4; 2; 2; 5; 5; 4; 5; 2; 3; 4; 4; 3; 4; 5; 2; 5; 5; 4; 3; 3; 4; 2; 4; 4; 5; 4; 3; 5; 3; 5; 4; 4; 5; 4; 4; 5; 4; 5; 5; 5.

Here is the number X - gradeis a discrete random variable, and the resulting list of estimates isstatistical (observable) data .

    arrange observation units in ascending order of the studied characteristic value:

2; 2; 2; 2; 2; 2; 2; 2; 3; 3; 3; 3; 3; 3; 3; 3; 3; 3; 3; 3; 4; 4; 4; 4; 4; 4; 4; 4; 4; 4; 4; 4; 4; 4; 4; 4; 4; 4; 4; 4; 4; 4; 4; 5; 5; 5; 5; 5; 5; 5; 5; 5; 5; 5; 5; 5; 5; 5; 5; 5.

2) determine all possible values ​​of the attribute x i, order them in ascending order:

In this example, all estimates can be divided into four groups with the following values: 2; 3; 4; 5.

The value of a random variable corresponding to a particular group of observed data is called the value of the attribute, option (option) and designate x i .

A number that shows how many times the corresponding value of a characteristic occurs in a number of observations is called frequency of attribute value and denote f i .

For our example

score 2 occurs - 8 times,

score 3 occurs - 12 times,

score 4 occurs - 23 times,

score 5 occurs - 17 times.

There are 60 ratings in total.

4) write the received data into a table of two rows (columns) - x i and f i.

Based on these data, it is possible to construct a discrete variation series

Discrete variation series – this is a table in which the occurring values ​​of the characteristic being studied are indicated as individual values ​​in ascending order and their frequencies

  1. Construction of an interval variation series

In addition to the discrete variational series, a method of grouping data such as an interval variational series is often encountered.

An interval series is constructed if:

    the sign has a continuous nature of change;

    There were a lot of discrete values ​​(more than 10)

    the frequencies of discrete values ​​are very small (do not exceed 1-3 with a relatively large number of observation units);

    many discrete values ​​of a feature with the same frequencies.

An interval variation series is a way of grouping data in the form of a table that has two columns (the values ​​of the characteristic in the form of an interval of values ​​and the frequency of each interval).

Unlike a discrete series, the values ​​of the characteristic of an interval series are represented not by individual values, but by an interval of values ​​(“from - to”).

The number that shows how many observation units fell into each selected interval is called frequency of attribute value and denote f i . The sum of all frequencies of a series is equal to the number of elements (units of observation) in the population being studied.

If a unit has a characteristic value equal to the upper limit of the interval, then it should be assigned to the next interval.

For example, a child with a height of 100 cm will fall into the 2nd interval, and not into the first; and a child with a height of 130 cm will fall into the last interval, and not into the third.

Based on these data, an interval variation series can be constructed.

Each interval has a lower bound (xn), an upper bound (xw) and an interval width ( i).

The interval boundary is the value of the attribute that lies on the border of two intervals.

children's height (cm)

children's height (cm)

amount of children

more than 130

If an interval has an upper and lower boundary, then it is called closed interval. If an interval has only a lower or only an upper boundary, then it is - open interval. Only the very first or the very last interval can be open. In the above example, the last interval is open.

Interval width (i) – the difference between the upper and lower limits.

i = x n - x in

The width of the open interval is assumed to be the same as the width of the adjacent closed interval.

children's height (cm)

amount of children

Interval width (i)

for calculations 130+20=150

20 (because the width of the adjacent closed interval is 20)

All interval series are divided into interval series with equal intervals and interval series with unequal intervals . In spaced rows with equal intervals, the width of all intervals is the same. In interval series with unequal intervals, the width of the intervals is different.

In the example under consideration - an interval series with unequal intervals.

Laboratory work No. 1

According to mathematical statistics

Topic: Primary processing of experimental data

3. Score in points. 1

5. Test questions.. 2

6. Methodology for performing laboratory work.. 3

Goal of the work

Acquiring skills in primary processing of empirical data using methods of mathematical statistics.

Based on the totality of experimental data, complete the following tasks:

Exercise 1. Construct an interval variation distribution series.

Task 2. Construct a histogram of frequencies of an interval variation series.

Task 3. Create an empirical distribution function and plot a graph.

a) mode and median;

b) conditional initial moments;

c) sample average;

d) sample variance, corrected population variance, corrected standard deviation;

e) coefficient of variation;

f) asymmetry;

g) kurtosis;

Task 5. Determine the boundaries of the true values ​​of the numerical characteristics of the random variable being studied with a given reliability.

Task 6. Content-based interpretation of the results of primary processing according to the conditions of the task.

Score in points

Tasks 1-56 points

Task 62 points

Defense of laboratory work(oral interview on test questions and laboratory work) - 2 points

The work must be submitted in written form on A4 sheets and includes:

1) Title page (Appendix 1)

2) Initial data.

3) Submission of work according to the specified sample.

4) Calculation results (done manually and/or using MS Excel) in the specified order.

5) Conclusions - meaningful interpretation of the results of primary processing according to the conditions of the problem.

6) Oral interview on work and control questions.



5. Test questions


Methodology for performing laboratory work

Task 1. Construct an interval variational distribution series

In order to present statistical data in the form of a variation series with equally spaced options, it is necessary:

1.In the original data table, find the smallest and largest values.

2.Define range of variation :

3. Determine the length of the interval h, if the sample contains up to 1000 data, use the formula: , where n – sample size – the amount of data in the sample; for calculations take lgn).

The calculated ratio is rounded to convenient integer value .

4. To determine the beginning of the first interval for an even number of intervals, it is recommended to take the value ; and for an odd number of intervals .

5. Write down the grouping intervals and arrange them in ascending order of boundaries

, ,………., ,

where is the lower limit of the first interval. A convenient number is taken that is no greater than , the upper limit of the last interval should be no less than . It is recommended that the intervals contain the initial values ​​of the random variable and be separated from 5 to 20 intervals.

6. Write down the initial data on grouping intervals, i.e. use the source table to calculate the number of random variable values ​​falling within the specified intervals. If some values ​​coincide with the boundaries of the intervals, then they are attributed either only to the previous or only to the subsequent interval.

Note 1. The intervals do not have to be equal in length. In areas where the values ​​are denser, it is more convenient to take smaller, short intervals, and where there are less frequent intervals, larger ones.

Note 2.If for some values ​​“zero” or small frequency values ​​are obtained, then it is necessary to regroup the data, enlarging the intervals (increasing the step).