What are called options? Variation series

The set of values ​​of the parameter studied in a given experiment or observation, ranked by value (increase or decrease) is called a variation series.

Suppose we measured arterial pressure in ten patients in order to obtain an upper blood pressure threshold: systolic pressure, i.e. only one number.

Let's imagine that a series of observations (statistical set) of arterial systolic pressure in 10 observations has next view(Table 1):

Table 1

The components of a variation series are called variants. The options represent the numerical value of the characteristic being studied.

Construction from statistical population observations of a variation series is only the first step towards understanding the characteristics of the entire population. Next you need to determine average level the quantitative characteristic being studied (average blood protein level, average weight of patients, average time of onset of anesthesia, etc.)

The average level is measured using criteria called averages. The average value is a generalizing numerical characteristic of qualitatively homogeneous values, characterizing with one number the entire statistical population according to one criterion. The average value expresses what is common to a characteristic in a given set of observations.

There are three types of averages in common use: mode (), median () and mean. arithmetic quantity ().

To determine any average value, it is necessary to use the results of individual observations, recording them in the form of a variation series (Table 2).

Fashion- the value that occurs most frequently in a series of observations. In our example, mode = 120. If there are no repeating values ​​in the variation series, then they say that there is no mode. If several values ​​are repeated the same number of times, then the smallest of them is taken as the mode.

Median- a value dividing a distribution into two equal parts, the central or median value of a series of observations ordered in ascending or descending order. So, if there are 5 values ​​in a variation series, then its median is equal to the third term of the variation series; if there is an even number of terms in the series, then the median is the arithmetic mean of its two central observations, i.e. if there are 10 observations in a series, then the median is equal to the arithmetic mean of the 5th and 6th observations. In our example.

Note important feature modes and medians: their values ​​are not affected numeric values extreme option.

Arithmetic mean calculated by the formula:

where is the observed value in the -th observation, and is the number of observations. For our case.

The arithmetic mean has three properties:

The average occupies the middle position in the variation series. In a strictly symmetrical row.

The average is a generalizing value and random fluctuations and differences in individual data are not visible behind the average. It reflects what is typical of the entire population.

The sum of deviations of all options from the average is zero: . The deviation of the option from the average is indicated.

The variation series consists of variants and their corresponding frequencies. Of the ten values ​​obtained, the number 120 occurred 6 times, 115 - 3 times, 125 - 1 time. Frequency () - the absolute number of individual variants in the aggregate, indicating how many times it occurs this option in the variation series.

The variation series can be simple (frequencies = 1) or grouped and shortened, with options 3-5. A simple series is used when there is a small number of observations (), grouped - when large number observations().

As a result of mastering this chapter, the student must: know

  • indicators of variation and their relationship;
  • basic laws of distribution of characteristics;
  • the essence of the consent criteria; be able to
  • calculate indices of variation and goodness-of-fit criteria;
  • determine distribution characteristics;
  • evaluate the main numerical characteristics statistical distribution series;

own

Variation indicators

At statistical research characteristics of various statistical aggregates, the study of variation in the characteristics of individual statistical units aggregate, as well as the nature of the distribution of units on this basis. Variation - these are differences in individual values ​​of a characteristic among units of the population being studied. The study of variation has a large practical significance. By the degree of variation, one can judge the limits of variation of a characteristic, the homogeneity of the population for a given characteristic, the typicality of the average, and the relationship of factors that determine the variation. Variation indicators are used to characterize and organize statistical populations.

Results of summary and grouping of materials statistical observation, designed in the form of statistical distribution series, represent an ordered distribution of units of the population being studied into groups according to grouping (variing) characteristics. If a qualitative characteristic is taken as the basis for the grouping, then such a distribution series is called attributive(distribution by profession, gender, color, etc.). If a distribution series is constructed on a quantitative basis, then such a series is called variational(distribution by height, weight, size wages etc.). To construct a variation series means to organize the quantitative distribution of population units by characteristic values, count the number of population units with these values ​​(frequency), and arrange the results in a table.

Instead of the frequency of a variant, it is possible to use its ratio to the total volume of observations, which is called frequency (relative frequency).

There are two types of variation series: discrete and interval. Discrete series- This is a variation series, the construction of which is based on characteristics with discontinuous change (discrete characteristics). The latter include the number of employees at the enterprise, tariff category, number of children in the family, etc. A discrete variation series represents a table that consists of two columns. The first column indicates specific meaning characteristic, and in the second - the number of units of the population with a certain value sign. If a characteristic has a continuous change (amount of income, length of service, cost of fixed assets of the enterprise, etc., which within certain limits can take on any values), then for this characteristic it is possible to construct interval variation series. When constructing an interval variation series, the table also has two columns. The first indicates the value of the attribute in the interval “from - to” (options), the second indicates the number of units included in the interval (frequency). Frequency (repetition frequency) - the number of repetitions of a particular variant of attribute values. Intervals can be closed or open. Closed intervals are limited on both sides, i.e. have both a lower (“from”) and an upper (“to”) boundary. Open intervals have one boundary: either upper or lower. If the options are arranged in ascending or descending order, then the rows are called ranked.

For variation series there are two types of options frequency characteristics: accumulated frequency and accumulated frequency. The accumulated frequency shows how many observations the value of the characteristic took values ​​less than a given value. The accumulated frequency is determined by summing the frequency values ​​of a characteristic for a given group with all frequencies of previous groups. The accumulated frequency characterizes specific gravity units of observation in which the characteristic values ​​do not exceed the upper limit of the data group. Thus, the accumulated frequency shows the proportion of options in the totality that have a value no greater than the given one. Frequency, frequency, absolute and relative densities, accumulated frequency and frequency are characteristics of the magnitude of the variant.

Variations in the characteristics of statistical units of the population, as well as the nature of the distribution, are studied using indicators and characteristics of the variation series, which include the average level of the series, the average linear deviation, the standard deviation, dispersion, coefficients of oscillation, variation, asymmetry, kurtosis, etc.

Average values ​​are used to characterize the distribution center. The average is a generalizing statistical characteristic in which it receives quantitative expression the typical level of a trait possessed by members of the population being studied. However, there may be cases where the arithmetic averages coincide when different character distribution, therefore as statistical characteristics variation series, the so-called structural averages are calculated - mode, median, as well as quantiles, which divide the distribution series into equal parts (quartiles, deciles, percentiles, etc.).

Fashion - This is the value of a characteristic that occurs in the distribution series more often than its other values. For discrete series, this is the option with the highest frequency. In interval variation series, in order to determine the mode, it is necessary to first determine the interval in which it is located, the so-called modal interval. In a variation series with equal intervals, the modal interval is determined by the highest frequency, in series with unequal intervals - but highest density distributions. The formula is then used to determine the mode in rows at equal intervals

where Mo is the fashion value; xMo - lower limit of the modal interval; h- modal interval width; / Mo - frequency of the modal interval; / Mo j is the frequency of the premodal interval; / Mo+1 is the frequency of the post-modal interval, and for a series with unequal intervals in this calculation formula, instead of the frequencies / Mo, / Mo, / Mo, distribution densities should be used Mind 0 _| , Mind 0> UMO+"

If there is a single mode, then the probability distribution of the random variable is called unimodal; if there is more than one mode, it is called multimodal (polymodal, multimodal), in the case of two modes - bimodal. As a rule, multimodality indicates that the distribution under study does not obey the law normal distribution. Homogeneous populations, as a rule, are characterized by single-vertex distributions. Multivertex also indicates the heterogeneity of the population being studied. The appearance of two or more vertices makes it necessary to regroup the data in order to identify more homogeneous groups.

In an interval variation series, the mode can be determined graphically using a histogram. To do this, draw two intersecting lines from the top points of the highest column of the histogram to the top points of two adjacent columns. Then, from the point of their intersection, a perpendicular is lowered onto the abscissa axis. The value of the feature on the x-axis corresponding to the perpendicular is the mode. In many cases, when characterizing a population as a generalized indicator, preference is given to the mode rather than the arithmetic mean.

Median - This central importance characteristic, it is possessed by the central member of the ranked distribution series. IN discrete series To find the value of the median, its ordinal number is first determined. To do this, if not even number units, one is added to the sum of all frequencies, the number is divided by two. If there are an even number of units in a row, there will be two median units, so in this case the median is defined as the average of the values ​​of the two median units. Thus, the median in a discrete variation series is the value that divides the series into two parts containing same number options.

In interval series, after determining the serial number of the median, the medial interval is found using the accumulated frequencies (frequencies), and then using the formula for calculating the median, the value of the median itself is determined:

where Me is the median value; x Me - lower limit of the median interval; h- width of the median interval; - the sum of the frequencies of the distribution series; /D - accumulated frequency of the pre-median interval; / Me - frequency of the median interval.

The median can be found graphically using a cumulate. To do this, on the scale of accumulated frequencies (frequencies), cumulates from the point corresponding to serial number median, a straight line is drawn parallel to the axis abscissa, until it intersects with the cumulate. Next, from the point of intersection of the indicated line with the cumulate, a perpendicular is lowered to the abscissa axis. The value of the attribute on the x-axis corresponding to the drawn ordinate (perpendicular) is the median.

The median is characterized by the following properties.

  • 1. It does not depend on those attribute values ​​that are located on either side of it.
  • 2. It has the property of minimality, which means that the sum of absolute deviations of the attribute values ​​from the median represents a minimum value compared to the deviation of the attribute values ​​from any other value.
  • 3. When combining two distributions with known medians, it is impossible to predict in advance the value of the median of the new distribution.

These properties of the median are widely used in designing point locations. queuing- schools, clinics, gas stations, water intake columns, etc. For example, if it is planned to build a clinic in a certain block of the city, then it would be more expedient to locate it at a point in the block that halves not the length of the block, but the number of residents.

The ratio of the mode, median and arithmetic mean indicates the nature of the distribution of the characteristic in the aggregate and allows us to assess the symmetry of the distribution. If x Me then there is a right-sided asymmetry of the series. With normal distribution X - Me - Mo.

K. Pearson based alignment various types curves determined that for moderately asymmetric distributions the following approximate relationships between the arithmetic mean, median and mode are valid:

where Me is the median value; Mo - meaning of fashion; x arithm - the value of the arithmetic mean.

If there is a need to study the structure of the variation series in more detail, then calculate characteristic values ​​similar to the median. Such characteristic values ​​divide all distribution units into equal numbers; they are called quantiles or gradients. Quantiles are divided into quartiles, deciles, percentiles, etc.

Quartiles divide the population into four equal parts. The first quartile is calculated similarly to the median using the formula for calculating the first quartile, having previously determined the first quarterly interval:

where Qi is the value of the first quartile; xQ^- lower limit of the first quartile range; h- width of the first quarter interval; /, - frequencies of the interval series;

Cumulative frequency in the interval preceding the first quartile interval; Jq ( - frequency of the first quartile interval.

The first quartile shows that 25% of the population units are less than its value, and 75% are more. The second quartile is equal to the median, i.e. Q 2 = Me.

By analogy, the third quartile is calculated, having first found the third quarterly interval:

where is the lower limit of the third quartile range; h- width of the third quartile interval; /, - frequencies of the interval series; /X" - accumulated frequency in the interval preceding

G

third quartile interval; Jq is the frequency of the third quartile interval.

The third quartile shows that 75% of the population units are less than its value, and 25% are more.

The difference between the third and first quartiles is the interquartile range:

where Aq is the value of the interquartile range; Q 3 - third quartile value; Q, is the value of the first quartile.

Deciles divide the population by 10 equal parts. A decile is a value of a characteristic in a distribution series that corresponds to tenths of the population size. By analogy with quartiles, the first decile shows that 10% of the population units are less than its value, and 90% are greater, and the ninth decile reveals that 90% of the population units are less than its value, and 10% are greater. The ratio of the ninth and first deciles, i.e. The decile coefficient is widely used in the study of income differentiation to measure the ratio of the income levels of the 10% most affluent and 10% of the least affluent population. Percentiles divide the ranked population into 100 equal parts. The calculation, meaning, and application of percentiles are similar to deciles.

Quartiles, deciles and others structural characteristics can be determined graphically by analogy with the median using cumulates.

To measure the size of variation, the following indicators are used: range of variation, average linear deviation, standard deviation, dispersion. The magnitude of the variation range depends entirely on the randomness of the distribution of the extreme members of the series. This indicator is of interest in cases where it is important to know what the amplitude of fluctuations in the values ​​of a characteristic is:

Where R- the value of the range of variation; x tah - maximum value sign; x tt - minimum value sign.

When calculating the range of variation, the value of the vast majority of series members is not taken into account, while the variation is associated with each value of the series member. Indicators that are averages obtained from deviations of individual values ​​of a characteristic from their average value do not have this drawback: the average linear deviation and the standard deviation. There is a direct relationship between individual deviations from the average and the variability of a particular trait. The stronger the fluctuation, the greater the absolute size of the deviations from the average.

The average linear deviation is the arithmetic mean of the absolute values ​​of deviations of individual options from their average value.

Average Linear Deviation for Ungrouped Data

where /pr is the value of the average linear deviation; x, - is the value of the attribute; X - P - number of units in the population.

Average linear deviation of the grouped series

where / vz - the value of the average linear deviation; x, is the value of the attribute; X - the average value of the characteristic for the population being studied; / - the number of population units in a separate group.

Signs of deviations in in this case are ignored in otherwise the sum of all deviations will be equal to zero. The average linear deviation depending on the grouping of the analyzed data is calculated according to various formulas: For grouped and ungrouped data. The average linear deviation, due to its conditionality, separately from other indicators of variation, is used in practice relatively rarely (in particular, to characterize the fulfillment of contractual obligations for uniformity of delivery; in turnover analysis foreign trade, composition of workers, rhythm of production, product quality taking into account technological features of production, etc.).

The standard deviation characterizes how much the average deviation is individual values of the studied characteristic from the average value of the population, and is expressed in units of measurement of the studied characteristic. The standard deviation, being one of the main measures of variation, is widely used in assessing the limits of variation of a characteristic in a homogeneous population, in determining the ordinate values ​​of a normal distribution curve, as well as in calculations related to the organization sample observation and establishing accuracy sample characteristics. The standard deviation of ungrouped data is calculated using the following algorithm: each deviation from the mean is squared, all squares are summed, after which the sum of squares is divided by the number of terms of the series and the square root is extracted from the quotient:

where a Iip is the value of the average square deviation; Xj- attribute value; X- the average value of the characteristic for the population being studied; P - number of units in the population.

For grouped analyzed data, the standard deviation of the data is calculated using the weighted formula

Where - standard deviation value; Xj- attribute value; X - the average value of the characteristic for the population being studied; f x - the number of population units in a particular group.

The expression under the root in both cases is called variance. Thus, dispersion is calculated as the average square of deviations of attribute values ​​from their average value. For unweighted (simple) attribute values, the variance is determined as follows:

For weighted characteristic values

There is also a special simplified method for calculating variance: in general

for unweighted (simple) characteristic values for weighted characteristic values
using the zero-based method

where a 2 is the dispersion value; x, - is the value of the attribute; X - average value of the characteristic, h- group interval value, t 1 - weight (A =

Dispersion has its own expression in statistics and is one of the most important indicators of variation. It is measured in units corresponding to the square of the units of measurement of the characteristic being studied.

The dispersion has the following properties.

  • 1. Variance constant value equal to zero.
  • 2. Reducing all values ​​of a characteristic by the same value A does not change the value of the dispersion. This means that the average square of deviations can be calculated not by given values sign, but by their deviations from some constant number.
  • 3. Reducing any characteristic values ​​in k times reduces the dispersion by k 2 times, and the standard deviation is in k times, i.e. all attribute values ​​can be divided into some constant number(say, by the value of the series interval), calculate the standard deviation, and then multiply it by a constant number.
  • 4. If we calculate the average square of deviations from any value And differing to one degree or another from the arithmetic mean, then it will always be greater than the average square of the deviations calculated from the arithmetic mean. Middle square in this case, there will be more deviations by a very certain amount - by the square of the difference between the average and this conventionally taken value.

Variation of an alternative characteristic consists in the presence or absence of the studied property in units of the population. Quantitatively, the variation of an alternative attribute is expressed by two values: the presence of a unit of the studied property is denoted by one (1), and its absence is denoted by zero (0). The proportion of units that have the property under study is denoted by P, and the proportion of units that do not have this property is denoted by G. Thus, the variance of an alternative attribute is equal to the product of the proportion of units possessing this property (P) by the proportion of units not possessing this property (G). The greatest variation of the population is achieved in cases where part of the population, constituting 50% of the total volume of the population, has a characteristic, and another part of the population, also equal to 50%, does not have this characteristic, and the dispersion reaches a maximum value of 0.25, t .e. P = 0.5, G= 1 - P = 1 - 0.5 = 0.5 and o 2 = 0.5 0.5 = 0.25. The lower limit of this indicator is zero, which corresponds to a situation in which there is no variation in the aggregate. Practical use variance of an alternative characteristic consists in constructing confidence intervals when conducting sample observation.

How less value variance and standard deviation, the more homogeneous the population and the more typical the average will be. In the practice of statistics, there is often a need to compare variations various signs. For example, it is interesting to compare variations in the age of workers and their qualifications, length of service and wages, cost and profit, length of service and labor productivity, etc. For such comparisons, indicators of absolute variability of characteristics are unsuitable: it is impossible to compare the variability of work experience, expressed in years, with the variation of wages, expressed in rubles. To carry out such comparisons, as well as comparisons of the variability of the same characteristic in several populations with different arithmetic averages, variation indicators are used - the oscillation coefficient, linear coefficient variations and coefficient of variation, which show the extent to which extreme values ​​fluctuate around the average.

Oscillation coefficient:

Where V R - oscillation coefficient value; R- value of the range of variation; X -

Linear coefficient of variation".

Where Vj- the value of the linear coefficient of variation; I - the value of the average linear deviation; X - the average value of the characteristic for the population being studied.

The coefficient of variation:

Where V a - coefficient of variation value; a is the value of the standard deviation; X - the average value of the characteristic for the population being studied.

The oscillation coefficient is percentage the range of variation to the average value of the characteristic being studied, and the linear coefficient of variation is the ratio of the average linear deviation to the average value of the characteristic being studied, expressed as a percentage. The coefficient of variation is the percentage of the standard deviation to the average value of the characteristic being studied. As a relative value, expressed as a percentage, the coefficient of variation is used to compare the degree of variation of various characteristics. Using the coefficient of variation, the homogeneity of a statistical population is assessed. If the coefficient of variation is less than 33%, then the population under study is homogeneous and the variation is weak. If the coefficient of variation is more than 33%, then the population under study is heterogeneous, the variation is strong, and the average value is atypical and cannot be used as a general indicator of this population. In addition, coefficients of variation are used to compare the variability of one trait in different populations. For example, to assess the variation in the length of service of workers at two enterprises. How more value coefficient, the more significant the variation of the characteristic.

Based on the calculated quartiles, it is also possible to calculate the relative indicator of quarterly variation using the formula

where Q 2 And

The interquartile range is determined by the formula

The quartile deviation is used instead of the range of variation to avoid the disadvantages associated with using extreme values:

For unequally interval variation series, the distribution density is also calculated. It is defined as the quotient of the corresponding frequency or frequency divided by the value of the interval. In unequal interval series, absolute and relative distribution densities are used. Absolute density distribution is the frequency per unit length of the interval. Relative distribution density - frequency per unit interval length.

All of the above is true for distribution series, the distribution law of which is well described normal law distribution or close to it.

Statistical distribution series are simplest form groups.

Statistical distribution series- this is an ordered quantitative distribution of population units on homogeneous groups on a varying (attributive or quantitative) basis.

Depending on the sign, underlying the formation of groups, a distinction is made between attributive and variational distribution series.

Attributive are called distribution series constructed according to qualitative characteristics, i.e. signs that do not have numerical expression. An example of an attribute distribution series is the distribution of the economically active population of the Russian Federation by gender in 2010 (Table 3.10).

Table 3.10. Distribution of the economically active population of the Russian Federation by gender in 2010

Variational are called distribution series built on a quantitative basis, i.e. a sign that has a numerical expression.

The variational distribution series consists of two elements: options and frequencies.

Options name the individual values ​​of a characteristic that it takes in a variation series.

Frequencies are the numbers of individual variants or each group of the variation series. Frequencies show how often certain values ​​of a characteristic occur in the population being studied. The sum of all frequencies determines the size of the entire population, its volume.

Frequencies are called frequencies expressed in fractions of a unit or as a percentage of the total. Accordingly, the sum of the frequencies is equal to 1, or 100%.

Depending on the nature of the variation of the trait distinguish between discrete and interval variation distribution series.

Discrete variation series distribution - This is a distribution series in which the groups are composed according to a characteristic that changes discontinuously, i.e. through certain number units, and accepts only integer values. For example, the distribution of the number of apartments built in Russian Federation by the number of rooms in them I! 2010 (Table 3.11).

Table 3.11. Distribution of the number of constructed apartments in the Russian Federation by the number of rooms in them in 2010.

Interval variation series distribution - This is a distribution series in which the grouping characteristic that forms the basis of the grouping can take on any values ​​in the interval that differ from each other by an arbitrarily small amount.

The construction of interval variation series is advisable primarily for continuous variation of a characteristic (Table 3.12), as well as if discrete variation of a characteristic manifests itself over a wide range (Table 3.13), i.e. the number of variants of a discrete characteristic is quite large.

Table 3.12. Distribution of the subjects of the South federal district Russian Federation by area as of January 1, 2011

Table 3.13. Distribution of subjects of the Central Federal District of the Russian Federation by number municipal institutions education as of January 1, 2011

The rules for constructing distribution series are similar to the rules for constructing groupings.

Analysis of distribution series can be clearly carried out based on their graphic image. For this purpose, a polygon, a histogram, and distributions are built.

Polygon used when depicting discrete variation distribution series. To build it in rectangular system coordinates along the abscissa axis on the same scale plot the ranked values ​​of the varying characteristic, and along the ordinate axis a scale is plotted to express the magnitude of the frequencies. Obtained at the intersection of the abscissa axis (X) and the ordinate axes (Y) are connected by straight lines, resulting in broken line, called a frequency polygon.

Histogram used to depict an interval variation series. When constructing a histogram, the values ​​of the intervals are plotted on the abscissa axis, and the frequencies are depicted by rectangles built on the corresponding intervals. The height of the columns should be proportional to the frequencies.

A histogram can be converted into a distribution polygon by connecting the midpoints of the top sides of the rectangles with straight lines.

When constructing a histogram of the distribution of a variation series with unequal intervals, it is not the frequencies that are plotted along the ordinate axis, but the density of the distribution of the characteristic in the corresponding intervals. Distribution density - is the frequency calculated per unit interval width,

those. how many units in each group are per unit of interval value.

A cumulative curve can be used to graphically display variation distribution series. By using cumulates depict a series of accumulated frequencies. Accumulated frequencies are determined by sequential summation of frequencies in groups.

When constructing the cumulates of the interval variation series along the abscissa axis (X) the variants of the series are plotted, and along the ordinate (Y) axis are the accumulated frequencies, which are plotted on the graph field in the form of perpendiculars to the abscissa axis in upper limits intervals. Then these perpendiculars are connected and a broken line is obtained, i.e. cumulate.

If, when graphically depicting a variational series of distributions in the form of cumulates of the axis X and U swap places, it turns out ogiva.

Rows built on a quantitative basis, are called variational.

The distribution series consist of options(characteristic values) and frequencies(number of groups). Frequencies expressed as relative values(shares, percent) are called frequencies. The sum of all frequencies is called the volume of the distribution series.

By type, the distribution series are divided into discrete(constructed based on discontinuous values ​​of the characteristic) and interval(built on continuous values sign).

Variation series represents two columns (or rows); one of which provides individual values ​​of a varying characteristic, called variants and denoted by X; and in the other - absolute numbers, showing how many times (how often) each option occurs. The indicators in the second column are called frequencies and are conventionally denoted by f. Let us note once again that in the second column both relative indicators, characterizing the share of frequency of individual variants in total amount frequency These relative indicators are called frequencies and are conventionally denoted by ω The sum of all frequencies in this case is equal to one. However, frequencies can also be expressed as percentages, and then the sum of all frequencies gives 100%.

If the variants of the variation series are expressed in the form discrete quantities, then such a variation series is called discrete.

For continuous characteristics, variation series are constructed as interval, that is, the values ​​of the attribute in them are expressed “from... to...”. In this case, the minimum values ​​of the characteristic in such an interval are called the lower limit of the interval, and the maximum - the upper limit.

Interval variation series are also constructed for discrete characteristics varying in wide range. Interval series may be with equal And unequal at intervals.

Let's consider how the value of equal intervals is determined. Let us introduce the following notation:

i– interval size;

- the maximum value of the characteristic for population units;

– the minimum value of the characteristic for population units;

n – number of allocated groups.

, if n is known.

If the number of groups to be distinguished is difficult to determine in advance, then to calculate the optimal value of the interval with a sufficient population size, the formula proposed by Sturgess in 1926 can be recommended:

n = 1+ 3.322 log N, where N is the number of units in the aggregate.

The size of unequal intervals is determined in each individual case, taking into account the characteristics of the object of study.

Statistical sample distribution call a list of options and their corresponding frequencies (or relative frequencies).

The statistical distribution of the sample can be specified in the form of a table, in the first column of which the options are located, and in the second - the frequencies corresponding to these options ni, or relative frequencies Pi .

Statistical distribution of the sample

Interval series are variation series in which the values ​​of the characteristics underlying their formation are expressed within certain limits (intervals). The frequencies in this case refer not to individual values characteristic, but to the entire interval.

Interval distribution series are constructed based on continuous quantitative characteristics, as well as on discrete characteristics that vary within significant limits.

An interval series can be represented by the statistical distribution of a sample indicating the intervals and their corresponding frequencies. In this case, the sum of the frequencies of the variants falling within this interval is taken as the frequency of the interval.

When grouping by quantitative continuous characteristics, determining the size of the interval is important.

In addition to the sample mean and sample variance, other characteristics of the variation series are also used.

Fashion The variant that has the highest frequency is called.

    All values ​​of the property under study that occur in the population under study are called the value of the attribute (option, option), and a change in this value by varying. Options are denoted in small letters of the Latin alphabet with indices corresponding to the serial number of the group - x i .

    A number that shows how many times each characteristic value occurs in the population being studied frequency and denote f i . The sum of all frequencies of the series is equal to the volume of the population being studied.

    Very often you need to count accumulated frequency (S). The accumulated frequency for each characteristic value shows how many units of the population have a characteristic value no greater than given value. The accumulated frequency is calculated by sequentially adding the following attribute values ​​to the frequency of the first value of the frequency sign:

The accumulated frequency begins to be calculated from the very first value of the attribute

The sum of frequencies is always equal to one or 100%. Replacing frequencies with frequencies allows one to compare variation series with different numbers of observations.

The frequencies of the series (f i) in some cases can be replaced by the frequencies (ω i).

If the variation series is given at unequal intervals, then for a correct idea of ​​the nature of the distribution it is necessary to calculate the absolute or relative density of the distribution.

    Absolute distribution density (p f ) represents the frequency value per unit interval size separate group row:

R f = f/ i.

    Relative distribution density (p ω ) represents the frequency value per unit size of the interval of a separate group of the series:

R ω = ω / i.

For series with unequal intervals, only these characteristics give a more correct idea of ​​the nature of the distribution than frequency and frequency.

    Statistical sample distribution name a list of options (sign values) and their corresponding frequencies or distribution densities, relative frequencies or relative densities distributions.

Different distribution series are characterized by different sets of frequency characteristics:

minimal – attribute series (frequency, frequency),

for discrete ones, four characteristics are used (frequency, frequency, accumulated frequency, accumulated frequency),

for interval ones – all five (frequency, frequency, accumulated frequency, accumulated frequency, absolute and relative distribution densities).

  1. Rules for constructing an interval variation series

  1. Graphic representation of variation series

The first stage of studying a variation series is to construct its graphical image. A graphical representation of variation series facilitates their analysis and allows one to judge the shape of the distribution. To graphically represent a variation series in statistics, a histogram, polygon, and cumulate distribution are constructed.

A discrete variation series is depicted as a so-called frequency polygon.

To display an interval series, a frequency distribution polygon and a frequency histogram are used.

Graphs are constructed in a rectangular coordinate system.