General population and sample basic concepts. Population and sampling method

In the previous section, we were interested in the distribution of a feature in a certain set of elements. A set that unites all elements that have this characteristic is called general. If the characteristic is human (nationality, education, IQ, etc.), then the general population is the entire population of the earth. This is a very large collection, that is, the number of elements in the collection n is large. The number of elements is called the volume of the population. Collections can be finite or infinite. The general population - all people, although very large, is, naturally, finite. The general population is all stars, probably infinitely.

If a researcher measures some continuous random variable X, then each measurement result can be considered an element of some hypothetical unlimited population. In this general population, countless results are distributed according to probability under the influence of errors in instruments, inattention of the experimenter, random interference in the phenomenon itself, etc.

If we carry out n repeated measurements of a random variable X, that is, we obtain n specific different numerical values, then this experimental result can be considered a sample of volume n from a hypothetical general population of results of single measurements.

It is natural to assume that the real value of the measured quantity is the arithmetic mean of the results. This function of n measurement results is called statistics, and it itself is a random variable having a certain distribution called the sampling distribution. Determining the sampling distribution of a particular statistic is the most important task of statistical analysis. It is clear that this distribution depends on the sample size n and on the distribution of the random variable X of the hypothetical population. The sampling distribution of statistics is the distribution of X q in the infinite population of all possible samples of size n from the original population.

You can also measure a discrete random variable.

Let the measurement of a random variable X be the throwing of a regular homogeneous triangular pyramid, on the sides of which the numbers 1, 2, 3, 4 are written. The discrete, random variable X has a simple uniform distribution:

The experiment can be performed an unlimited number of times. A hypothetical theoretical population is an infinite population in which there are equal shares (0.25 each) of four different elements, designated by the numbers 1, 2, 3, 4. A series of n repeated throwing of a pyramid or simultaneous throwing of n identical pyramids can be considered as a sample of volume n from this general population. As a result of the experiment, we have n numbers. It is possible to introduce some functions of these quantities, which are called statistics; they can be associated with certain parameters of the general distribution.

The most important numerical characteristics of distributions are probabilities P i , mathematical expectation M, variance D. Statistics for probabilities P i are relative frequencies, where n i is the frequency of result i (i = 1,2,3,4) in the sample. The mathematical expectation M corresponds to statistics

which is called the sample mean. Sample variance

corresponds to the general variance D.

The relative frequency of any event (i=1,2,3,4) in a series of n repeated trials (or in samples of size n from the population) will have a binomial distribution.

This distribution has a mathematical expectation equal to 0.25 (does not depend on n), and a standard deviation equal to (quickly decreases as n increases). The distribution is a sampling distribution statistic, the relative frequency of any of the four possible outcomes of a single pyramid toss in n repeated trials. If we were to select from an infinite general population, in which four different elements (i = 1,2,3,4) have equal shares of 0.25, all possible samples of size n (their number is also infinite), we would get the so-called mathematical sample size n. In this sample, each of the elements (i=1,2,3,4) is distributed according to the binomial law.

Let's say we threw this pyramid, and the number two came up 3 times (). We can find the probability of this outcome using the sampling distribution. It is equal

Our result was highly unlikely; in a series of twenty-four multiple throws it occurs approximately once. In biology, such a result is usually considered practically impossible. In this case, we will have doubts: is the pyramid correct and homogeneous, is the equality true in one throw, is the distribution and, therefore, the sampling distribution correct.

To resolve the doubt, you need to throw it four times again. If the result appears again, the probability of two results with is very small. It is clear that we have obtained an almost completely impossible result. Therefore, the original distribution is incorrect. Obviously, if the second result turns out to be even more unlikely, then there is even more reason to deal with this “correct” pyramid. If the result of the repeated experiment is and, then we can assume that the pyramid is correct, and the first result () is also correct, but simply improbable.

We could not have checked the correctness and homogeneity of the pyramid, but considered a priori the pyramid to be correct and homogeneous, and, therefore, the sampling distribution correct. Next, we should find out what knowledge of the sampling distribution provides for studying the general population. But since establishing the sampling distribution is the main goal of statistical research, a detailed description of the pyramid experiments can be considered justified.

We assume that the sampling distribution is correct. Then the experimental values ​​of the relative frequency in different series of n throwings of the pyramid will be grouped around the value of 0.25, which is the center of the sampling distribution and the exact value of the estimated probability. In this case, the relative frequency is said to be an unbiased estimate. Since the sample dispersion tends to zero as n increases, the experimental values ​​of the relative frequency will be more and more closely grouped around the mathematical expectation of the sample distribution as the sample size increases. Therefore, it is a consistent estimate of probability.

If the pyramid turned out to be directional and heterogeneous, then the sample distributions for different (i = 1,2,3,4) would have different mathematical expectations (different) and variances.

Note that the binomial sampling distributions obtained here for large n() are well approximated by the normal distribution with parameters and, which greatly simplifies the calculations.

Let's continue the random experiment - throwing a regular, uniform, triangular pyramid. The random variable X associated with this experiment has a distribution. The mathematical expectation here is

Let us carry out n casts, which is equivalent to a random sample of size n from a hypothetical, infinite, population containing equal shares (0.25) of four different elements. We obtain n sample values ​​of the random variable X (). Let's choose a statistic that represents the sample mean. The value itself is a random variable that has a distribution depending on the sample size and the distribution of the original random variable X. The value is the averaged sum of n identical random variables (that is, with the same distribution). It's clear that

Therefore, the statistic is an unbiased estimate of the mathematical expectation. It is also a valid estimate because

Thus, the theoretical sampling distribution has the same mathematical expectation as the original distribution; the variance is reduced by n times.

Recall that it is equal to

A mathematical, abstract infinite sample associated with a sample of size n from the general population and with the entered statistics will contain, in our case, elements. For example, if, then the mathematical sample will contain elements with statistics values. There will be 13 elements in total. The share of extreme elements in the mathematical sample will be minimal, since the results have equal probabilities. Among the many elementary outcomes of throwing the pyramid four times, there is only one favorable one each. As statistics approach average values, the probabilities will increase. For example, the value will be realized with elementary outcomes, etc. Accordingly, the share of element 1.5 in the mathematical sample will increase.

The average value will have the maximum probability. As n increases, the experimental results will cluster more closely around the average value. The fact that the sample mean is equal to the original population mean is often used in statistics.

If you perform probability calculations in the sample distribution c, you can be sure that even with such a small value of n, the sample distribution will look like normal. It will be symmetric, in which the value will be the median, mode and mathematical expectation. As n increases, it is well approximated by the corresponding normal one, even if the original distribution is rectangular. If the original distribution is normal, then the distribution is the Student distribution for any n.

To estimate the general variance, it is necessary to choose a more complex statistic that provides an unbiased and consistent estimate. In the sampling distribution for S 2 the mathematical expectation is equal to and the variance. With large sample sizes, the sampling distribution can be considered normal. For small n and a normal initial distribution, the sampling distribution for S 2 will be h 2 _distribution.

Above we have tried to present the first steps of a researcher trying to carry out a simple statistical analysis of repeated experiments with a regular uniform triangular prism (tetrahedron). In this case, we know the original distribution. It is possible, in principle, to theoretically obtain sample distributions of the relative frequency, sample mean and sample variance depending on the number of repeated experiments n. For large n, all these sample distributions will approach the corresponding normal distributions, since they represent the laws of distribution of sums of independent random variables (central limit theorem). So we know the expected results.

Repeated experiments or samples will provide estimates of the parameters of the sampling distributions. We argued that the experimental estimates would be correct. We did not perform these experiments and did not even present the experimental results obtained by other researchers. It can be emphasized that when determining distribution laws, theoretical methods are used more often than direct experiments.

This is a science that, based on the methods of probability theory, deals with the systematization and processing of statistical data to obtain scientific and practical conclusions.

Statistical data refers to information about the number of objects that have certain characteristics .

A group of objects united according to some qualitative or quantitative characteristic is called statistical totality . The objects included in a collection are called its elements, and their total number is its volume.

General population is the set of all conceivably possible observations that could be made under a given real set of conditions or more strictly: the general population is the random variable x and the associated probability space (W, Á, P).

The distribution of a random variable x is called population distribution(they talk, for example, about a normally distributed or simply normal population).

For example, if a number of independent measurements of a random variable are made x, then the general population is theoretically infinite (i.e. the general population is an abstract, conventionally mathematical concept); if the number of defective products in a batch of N products is checked, then this batch is considered as a finite general population of volume N.

In the case of socio-economic research, the general population of volume N can be the population of a city, region or country, and the measured characteristics can be income, expenses or the amount of savings of an individual person. If some attribute is of a qualitative nature (for example, gender, nationality, social status, occupation, etc.), but belongs to a finite set of options, then it can also be encoded as a number (as is often done in questionnaires).

If the number of objects N is large enough, then it is difficult and sometimes physically impossible to conduct a comprehensive survey (for example, check the quality of all cartridges). Then a limited number of objects are randomly selected from the entire population and subjected to study.

Sample population or simply sampling of volume n is a sequence x 1 , x 2 , ..., x n of independent identically distributed random variables, the distribution of each of which coincides with the distribution of the random variable x.

For example, the results of the first n measurements of a random variable x It is customary to consider it as a sample of size n from an infinite population. The data obtained is called observations of a random variable x, and they also say that the random variable x “takes on the values” x 1, x 2, …, x n.


The main task of mathematical statistics is to make scientifically based conclusions about the distribution of one or more unknown random variables or their relationship with each other. The method consisting in the fact that, based on the properties and characteristics of the sample, conclusions are made about the numerical characteristics and the distribution law of a random variable (general population) is called by selective method.

In order for the characteristics of a random variable obtained by the sampling method to be objective, it is necessary that the sample be representative those. represented the studied quantity quite well. By virtue of the law of large numbers, it can be argued that the sample will be representative if it is carried out randomly, i.e. All objects in the population have the same probability of being included in the sample. There are different types of sample selection for this purpose.

1. Simple random sampling is a selection in which objects are selected one at a time from the entire population.

2. Stratified (stratified) selection is that the original population of volume N is divided into subsets (strata) N 1, N 2,...,N k, so that N 1 + N 2 +...+ N k = N. When strata are determined, from each from them a simple random sample of volume n 1, n 2, ..., n k is extracted. A special case of stratified selection is typical selection, in which objects are selected not from the entire population, but from each typical part of it.

Combined selection combines several types of selection at once, forming different phases of a sample survey. There are other sampling methods.

The sample is called repeated , if the selected object is returned to the population before selecting the next one. The sample is called repeatable , if the selected object is not returned to the population. For a finite population, random selection without return leads at each step to the dependence of individual observations, and random equally possible selection with return leads to independence of observations. In practice, we usually deal with non-repetitive samples. However, when the population size N is many times larger than the sample size n (for example, hundreds or thousands of times), the dependence of the observations can be neglected.

Thus, a random sample x 1, x 2, ..., x n is the result of sequential and independent observations of a random variable ξ, representing the general population, and all elements of the sample have the same distribution as the original random variable x.

We will call the distribution function F x (x) and other numerical characteristics of the random variable x theoretical, Unlike sample characteristics , which are determined from the results of observations.

Let the sample x 1, x 2, ..., x k be the result of independent observations of a random variable x, and x 1 was observed n 1 times, x 2 - n 2 times, ..., x k - n k times, so that n i = n - sample size. The number n i showing how many times the value x i appeared in n observations is called frequency given value, and the ratio n i /n = w i- relative frequency. Obviously the numbers w i are rational and .

A statistical population arranged in ascending order of a characteristic is called variation series . Its members are denoted x (1), x (2), ... x (n) and are called options . The variation series is called discrete, if its members take specific isolated values. Statistical distribution sampling of a discrete random variable x called a list of options and their corresponding relative frequencies w i. The resulting table is called statistically close.

X (1) x(2) ... x k(k)
ω 1 ω 2 ... ωk

The largest and smallest values ​​of the variation series are denoted by x min and x max and are called extreme members of the variation series.

If a continuous random variable is studied, then grouping consists of dividing the interval of observed values ​​into k partial intervals of equal length h, and counting the number of observations that fall into these intervals. The resulting numbers are taken as frequencies n i (for some new, already discrete random variable). The middle values ​​of the intervals are usually taken as new values ​​for option x i (or the intervals themselves are indicated in the table). According to the Sturges formula, the recommended number of partition intervals is k » 1 + log 2 n, and the lengths of partial intervals are equal to h = (x max - x min)/k. It is assumed that the entire interval has the form .

Graphically, statistical series can be presented in the form of a polygon, a histogram or a graph of accumulated frequencies.

Frequency polygon called a broken line, the segments of which connect the points (x 1, n 1), (x 2, n 2), ..., (x k, n k). Polygon relative frequencies called a broken line, the segments of which connect the points (x 1, w 1), (x 2, w 2), …, (x k , w k). Polygons usually serve to represent a sample in the case of discrete random variables (Fig. 7.1.1).

Rice. 7.1

.1.

Relative frequency histogram called a stepped figure consisting of rectangles, the base of which are partial intervals of length h, and the height

equal w i/h.

A histogram is usually used to depict a sample in the case of continuous random variables. The area of ​​the histogram is equal to one (Fig. 7.1.2). If you connect the midpoints of the upper sides of the rectangles on a histogram of relative frequencies, then the resulting broken line forms a polygon of relative frequencies. Therefore, a histogram can be viewed as a graph empirical (sample) distribution density fn(x). If the theoretical distribution has a finite density, then the empirical density is some approximation of the theoretical one.

Graph of accumulated frequencies is a figure constructed similarly to a histogram with the difference that to calculate the heights of rectangles, not simple ones are taken, but accumulated relative frequencies, those. quantities These values ​​do not decrease, and the graph of accumulated frequencies has the form of a stepped “staircase” (from 0 to 1).

The graph of accumulated frequencies is used in practice to approximate the theoretical distribution function.

Task. A sample of 100 small enterprises in the region is analyzed. The purpose of the survey is to measure the ratio of borrowed and equity funds (x i) at each i-th enterprise. The results are presented in Table 7.1.1.

Table Ratios of debt and equity capital of enterprises.

5,56 5,45 5,48 5,45 5,39 5,37 5,46 5,59 5,61 5,31
5,46 5,61 5,11 5,41 5.31 5,57 5,33 5,11 5,54 5,43
5,34 5,53 5,46 5,41 5,48 5,39 5,11 5,42 5,48 5,49
5,36 5,40 5,45 5,49 5,68 5,51 5,50 5,68 5,21 5,38
5,58 5,47 5,46 5,19 5,60 5,63 5,48 5,27 5,22 5,37
5,33 5,49 5,50 5,54 5,40 5.58 5,42 5,29 5,05 5,79
5,79 5,65 5,70 5,71 5,85 5,44 5,47 5,48 5,47 5,55
5,67 5,71 5,73 5,05 5,35 5,72 5,49 5,61 5,57 5,69
5,54 5,39 5,32 5,21 5,73 5,59 5,38 5,25 5,26 5,81
5,27 5,64 5,20 5,23 5,33 5,37 5,24 5,55 5,60 5,51

Construct a histogram and graph of accumulated frequencies.

Solution. Let's build a grouped series of observations:

1. Let us determine in the sample x min = 5.05 and x max = 5.85;

2. Let's divide the entire range into k equal intervals: k » 1 + log 2 100 = 7.62; k = 8, hence the length of the interval

Table 7.1.2. Grouped series of observations

Interval Number Intervals Midpoints of intervals x i w i fn(x)
5,05-5,15 5,1 0,05 0,05 0,5
5,15-5,25 5,2 0,08 0,13 0,8
5,25-5,35 5,3 0,12 0,25 1,2
5,35-5,45 5,4 0,20 0,45 2,0
5,45-5,55 5,5 0,26 0,71 2,6
5,55-5,65 5,6 0,15 0,86 1,5
5,65-5,75 5,7 0,10 0,96 1,0
5,75-5,85 5,8 0,04 1,00 0,4

In Fig. 7.1.3 and 7.1.4, built according to the data in Table 7.1.2, present a histogram and graph of accumulated frequencies. The curves correspond to the density and normal distribution function "fitted" to the data.

Thus, the sample distribution is some approximation of the population distribution.

The entire array of individuals of a certain category is called the general population. The size of the population is determined by the objectives of the study.

If a species of wild animal or plant is studied, then the general population will be all individuals of this species. In this case, the volume of the general population will be very large and in calculations it is taken as an infinitely large value.

If the effect of an agent on plants and animals of a certain category is being studied, then the general population will be all plants and animals of that category (species, sex, age, economic purpose) to which the experimental objects belonged. This is no longer a very large number of individuals, but it is still inaccessible for comprehensive study.

The volume of the general population is not always available for a comprehensive study. Sometimes small populations are studied, for example, the average milk yield or the average wool clipping of a group of animals assigned to a certain worker is determined. In such cases, the population will be a very small number of individuals, all of which are studied. A small population is also found when studying plants or animals found in a collection in order to characterize a certain group in this collection.

Characteristics of group properties (etc.) related to the entire population are called general parameters.

A sample is a group of objects that differ in three features:

1 is part of the general population;

2 randomly selected in a certain way;

3 studied to characterize the entire population.

In order to obtain a fairly accurate characteristic of the entire population from a sample, it is necessary to organize the correct selection of objects from the population.

Theory and practice have developed several systems for selecting individuals for sampling. All these systems are based on the desire to provide the maximum opportunity to select any object from the general population. Tendency and bias in the selection of objects for a sample study prevent the receipt of correct general conclusions and make the results of a sample study non-indicative of the entire population, i.e., unrepresentative.

To obtain a correct, undistorted characteristic of the entire population, it is necessary to strive to ensure the possibility of selecting any object from any part of the population into the sample. This basic requirement must be fulfilled the more strictly, the more variable the trait being studied. It is understandable that when diversity approaches zero, such as in the case of studies of hair or feather color in some species, any method of sample selection will produce representative results.

In various studies, the following methods of selecting objects in the sample are used.

4 Random repeated selection, in which objects of study are selected from the general population without first taking into account the development of the characteristic being studied, i.e., in a random (for a given characteristic) order; After selection, each object is studied and then returned to its population, so that any object can be re-selected. This method of selection is equivalent to selection from an infinitely large general population, for which the main indicators of the relationship between sample and general values ​​have been developed.

5 Random non-repetitive selection, in which objects selected, as in the previous method, by chance, do not return to the general population and cannot be re-entered into the sample. This is the most common way to organize a sample; it is equivalent to selection from a large but limited population, which is taken into account when determining general indicators from samples.

6 Mechanical selection, in which objects are selected from individual parts of the general population, and these parts are preliminarily designated mechanically according to squares of the experimental field, according to random groups of animals taken from different areas of the population, etc. Usually as many such parts are outlined as are expected to be taken objects to be studied, so the number of parts is equal to the size of the sample. Mechanical selection is sometimes carried out by choosing to study individuals after a certain number, for example, by passing animals through a split and selecting every tenth, hundredth, etc., or by taking a mow every 100 or 200 m, or by selecting one object every 10 encountered. 100, etc. specimens when studying the entire population.

8 Serial (cluster) selection, in which the general population is divided into parts - series, some of them are studied entirely. This method is used successfully in cases where the objects under study are fairly evenly distributed in a certain volume or over a certain territory. For example, when studying the contamination of air or water with microorganisms, samples are taken and subjected to complete examination. In some cases, agricultural objects can also be surveyed using the nesting method. When studying the yield of meat and other processed products of a meat breed of livestock, the sample can include all animals of this breed that arrived at two or three meat processing plants. When studying egg size in collective farm poultry farming, it is possible to study this trait in several collective farms across the entire chicken population.

Characteristics of group properties (μ, s etc.) obtained for the sample are called sample indicators.

Representativeness

Direct study of a group of selected objects provides, first of all, primary material and characteristics of the sample itself.

All sample data and summary indicators are important as primary facts revealed by the study and are subject to careful consideration, analysis and comparison with the results of other works. But this does not limit the process of extracting information inherent in the primary research materials.

The fact that objects were selected for the sample using special techniques and in sufficient quantity makes the results of the study of the sample indicative not only for the sample itself, but also for the entire population from which this sample was taken.

A sample, under certain conditions, becomes a more or less accurate reflection of the entire population. This property of a sample is called representativeness, which means representativeness with a certain accuracy and reliability.

Like any property, the representativeness of sample data can be expressed to a sufficient or insufficient extent. In the first case, reliable estimates of the general parameters are obtained in the sample, in the second - unreliable ones. It is important to remember that obtaining unreliable estimates does not detract from the value of sample indicators for characterizing the sample itself. Obtaining reliable estimates expands the scope of application of the achievements obtained in a sample study.

Population- the totality of all objects (units) regarding which a scientist intends to draw conclusions when studying a specific problem. The population consists of all objects that are subject to study. The composition of the population depends on the objectives of the study. Sometimes the general population is the entire population of a certain region (for example, when studying the attitude of potential voters towards a candidate), most often several criteria are specified that determine the object of the study. For example, women 18-29 years old who use certain brands of hand cream at least once a week and have an income of at least $150 per family member.

Sample- a set of cases (subjects, objects, events, samples), using a certain procedure, selected from the general population to participate in the study.

  1. Sample size;
  2. Dependent and independent samples;
  3. Representativeness:
    1. An example of a non-representative sample;
  4. Types of plan for constructing groups from samples;
  5. Group building strategies:
    1. Randomization;
    2. Pairwise selection;
    3. Stratometric selection;
    4. Approximate modeling.

Sample size- the number of cases included in the sample population. For statistical reasons, it is recommended that the number of cases be at least 30-35.

Dependent and independent samples

When comparing two (or more) samples, an important parameter is their dependence. If it is possible to establish a homomorphic pair (that is, when one case from sample X corresponds to one and only one case from sample Y and vice versa) for each case in two samples (and this basis for the relationship is important for the trait being measured in the samples), such samples are called dependent. Examples of dependent samples: pairs of twins, two measurements of a trait before and after experimental influence, husbands and wives, etc.

If there is no such relationship between the samples, then these samples are considered independent, for example: men and women, psychologists and mathematicians.

Accordingly, dependent samples always have the same size, while the size of independent samples may differ.

Comparison of samples is made using various statistical criteria:

  • Student's t-test;
  • Wilcoxon T-test;
  • Mann-Whitney U test;
  • Sign criterion, etc.

Representativeness

The sample may be considered representative or non-representative.

Example of a non-representative sample

In the United States, one of the most famous historical examples of non-representative sampling is considered to be the case that occurred during the presidential election in 1936. The Literary Digest magazine, which had successfully predicted the events of several previous elections, was wrong in its predictions by sending out ten million test ballots to its subscribers, people selected from telephone books throughout the country, and from people on car registration lists. In 25% of returned ballots (almost 2.5 million), the votes were distributed as follows:

57% preferred Republican candidate Alf Landon

40% chose then-Democratic President Franklin Roosevelt

In the actual elections, as is known, Roosevelt won, gaining more than 60% of the votes. The Literary Digest's mistake was this: wanting to increase the representativeness of the sample - since they knew that most of their subscribers considered themselves Republicans - they expanded the sample to include people selected from telephone books and registration lists. However, they did not take into account the realities of their time and in fact recruited even more Republicans: during the Great Depression, it was mainly representatives of the middle and upper classes who could afford to own phones and cars (that is, most Republicans, not Democrats).

Types of plan for constructing groups from samples

There are several main types of group building plans:

  1. A study with experimental and control groups, which are placed in different conditions;
  2. A study with experimental and control groups using a pairwise selection strategy;
  3. A study using only one group - experimental;
  4. A study using a mixed (factorial) design - all groups are placed in different conditions.

Group Building Strategies

The selection of groups for participation in a psychological experiment is carried out using various strategies, which are necessary in order to ensure the greatest possible respect for internal and external validity:

  1. Randomization (random selection);
  2. Pairwise selection;
  3. Stratometric selection;
  4. Approximate modeling;
  5. Attracting real groups.

Randomization

Randomization, or random sampling, is used to create simple random samples. The use of such a sample is based on the assumption that each member of the population is equally likely to be included in the sample. For example, to make a random sample of 100 university students, you can put pieces of paper with the names of all university students in a hat, and then take 100 pieces of paper out of it - this will be a random selection

Pairwise selection

Pairwise selection is a strategy for constructing sampling groups in which groups of subjects are composed of subjects who are equivalent in terms of secondary parameters that are significant for the experiment. This strategy is effective for experiments using experimental and control groups, with the best option being the involvement of twin pairs (mono- and dizygotic), as it allows you to create.

Stratometric selection

Stratometric selection - randomization with the allocation of strata (or clusters). With this method of sampling, the general population is divided into groups (strata) with certain characteristics (gender, age, political preferences, education, income level, etc.), and subjects with the corresponding characteristics are selected.

Approximate Modeling

Approximate modeling - drawing limited samples and generalizing conclusions about this sample to a wider population. For example, with the participation of 2nd year university students in the study, the data of this study applies to “people aged 17 to 21 years”. The admissibility of such generalizations is extremely limited.

So, the patterns to which the random variable under study is subject are physically completely determined by the real set of conditions for its observation (or experiment), and are mathematically specified by the corresponding probability space or, what is the same, by the corresponding law of probability distribution. However, when conducting statistical research, another terminology associated with the concept of a general population turns out to be somewhat more convenient.

The general population is the totality of all conceivable observations (or all mentally possible objects of the type we are interested in, from which observations are “taken”) that could be made under a given real set of conditions. Since the definition deals with all mentally possible observations (or objects), the concept of a general population is a conditionally mathematical, abstract concept and should not be confused with real populations subject to statistical research. Thus, having examined even all enterprises of a sub-industry from the point of view of recording the values ​​of the technical and economic indicators characterizing them, we can consider the surveyed population only as a representative of a hypothetically possible wider population of enterprises that could operate within the same real set of conditions

In practical work, it is more convenient to associate the choice with the objects of observation rather than with the characteristics of these objects. We select machines, geological samples, people for study, but not the values ​​of the characteristics of machines, samples, people. On the other hand, in mathematical theory, objects and the set of their characteristics do not differ and the duality of the introduced definition disappears.

As we see, the mathematical concept of “general population” is physically completely determined, as well as the concepts of “probability space”, “random variable” and “probability distribution law”, by the corresponding real set of conditions, and therefore all these four mathematical concepts can be considered in a certain meaning synonyms. A population is called finite or infinite depending on whether the collection of all conceivable observations is finite or infinite.

From the definition it follows that continuous populations (consisting of observations of signs of a continuous nature) are always infinite. Discrete general populations can be either infinite or finite. For example, if a batch of N products is analyzed for grade (see example in clause 4.1.3), when each product can be assigned to one of four grades, the random variable under study is the grade number of a product randomly extracted from the batch, and the set of possible values random variable consists of four points respectively (1, 2, 3 and 4), then, obviously, the population will be finite (only N conceivable observations).

The concept of an infinite population is a mathematical abstraction, as is the idea that the measurement of a random variable can be repeated an infinite number of times. An approximately infinite general population can be interpreted as a limiting case of a finite one, when the number of objects generated by a given real set of conditions increases indefinitely. So, if in the example just given, instead of batches of products, we consider continuous mass production of the same products, then we will arrive at the concept of an infinite general population. In practice, such a modification is equivalent to the requirement

A sample from a given population is the results of a limited series of observations of a random variable. A sample can be considered as a kind of empirical analogue of a general population, something that we most often deal with in practice, since surveying the entire general population can be either too labor-intensive (in the case of large N) or fundamentally impossible (in the case of infinite general populations).

The number of observations that form a sample is called the sample size.

If the sample size is large and we are dealing with a one-dimensional continuous value (or with a one-dimensional discrete value, the number of possible values ​​of which is quite large, say more than 10), then it is often more convenient, from the point of view of simplifying further statistical processing of observational results, to move on to the so-called "grouped" sample data. This transition is usually carried out as follows:

a) the smallest and largest values ​​in the sample are noted;

b) the entire surveyed range is divided into a certain number of 5 equal grouping intervals; in this case, the number of intervals s should not be less than 8-10 and more than 20-25: the choice of the number of intervals significantly depends on the sample size; for an approximate orientation in the choice 5, you can use the approximate formula

which should be taken rather as a lower estimate for s (especially for large

c) the extreme points of each of the intervals are marked in ascending order, as well as their midpoints

d) the number of sample data falling into each of the intervals is counted: (obviously); sample data that falls on the boundaries of the intervals are either evenly distributed over two adjacent intervals, or they are agreed to be assigned only to one of them, for example, to the left one.

Depending on the specific content of the problem, some modifications may be made to this grouping scheme (for example, in some cases it is advisable to abandon the requirement of equal lengths of grouping intervals).

In all further arguments using sample data, we will proceed from the notation just described.

Let us recall that the essence of statistical methods is to use a certain part of the general population (i.e., a sample) to make judgments about its properties as a whole.

One of the most important issues, the successful solution of which determines the reliability of the conclusions obtained as a result of statistical processing of data, is the issue of representativeness of the sample, i.e. the question of the completeness and adequacy of its representation of the properties of the analyzed general population that interest us. In practical work, the same group of objects taken for study can be considered as a sample from different general populations. Thus, a group of families randomly selected from the cooperative houses of one of the housing maintenance offices (ZhEK) of one of the city districts for a detailed sociological survey can be considered both as a sample from the general population of families (with a cooperative form of housing) of this ZhEK, and as a sample from the general population families of a given area, and as a sample from the general population of all families in the city, and, finally, as a sample from the general population of all families in the city living in cooperative houses. The meaningful interpretation of the testing results significantly depends on which general population we are considering the selected group of families as a representative of, for which general population this sample can be considered representative. The answer to this question depends on many factors. In the above example, in particular, it depends on the presence or absence of a special (perhaps hidden) factor that determines the family’s belonging to a given housing office or the district as a whole (such a factor could be, for example, the average per capita income of the family, the geographic location of the district in the city, “ age" of the area, etc.).