Entropy. Maximum entropy principle

Information theory

At the origins of information theory is Claude Shannon, who in 1947-48 worked on the issue of the efficiency of communication systems. As a result, the goal of this theory was formulated - to increase the capacity of the communication channel. An effective system is one that, other conditions and costs being equal, transmits more information. Typically, the analysis considers the object: the source of information and the channel for transmitting information.

So, there are some events. Information about them in symbolic form, in the form of a signal, is transmitted over a communication channel. It can be argued that a channel is good if it meets two conditions. Firstly, information is transmitted through it at high speed and secondly, interference affecting the transmission reduces the quality of information slightly. In order to find the conditions for such a transfer, it is necessary to enter some information characteristics.

The basic principles of information theory are most clearly manifested with a discrete source and the same channel. Therefore, we will begin our acquaintance with the topic with this assumption.

1.1 Quantitative measure of information.

First, let's figure out what makes sense to transmit over the channel.

If the recipient knows what information will be transmitted, then there is obviously no need to transmit it. It makes sense to convey only what is unexpected. The greater the surprise, the more information should be contained in this event. For example, you work at a computer. A message that today’s work must be completed in 45 minutes. according to the schedule is unlikely to be new to you. This was absolutely clear even before the announcement of the end of work. Therefore, such a message contains zero information; there is no point in passing it on. And now another example. The message is as follows: in an hour, your boss will give you a plane ticket to Moscow and back, and will also allocate a sum of money for entertainment. This kind of information is unexpected for you and, therefore, contains a large number of units of measure. These are the kinds of messages that make sense to convey through the channel. The conclusion is very simple: the more surprise there is in a message, the more information it contains.

Surprise is characterized by probability, which is included in the information measure.

A few more examples. We have two boxes, one with white balls and the other with black balls. How much information is contained in the message where the white balls are? The probability that any given box contains white balls is 0.5. Let's call this probability up to experience or a priori .

Now we take out one ball. Regardless of which ball we took out, after such an experiment we will absolutely know in which box the white balls are in. Therefore, the probability of information will be equal to 1. This probability is called after the experimental or a posteriori .

Let's look at this example from the point of view of the amount of information. So, we have a source of information - boxes with balls. Initially, the uncertainty about the balls was characterized by a probability of 0.5. Then the source “spoke” and gave out information; we pulled out the ball. Further, everything became determined with probability 1. It is logical to take the degree of reduction in uncertainty about an event as a result of experience as a quantitative measure of information. In our example it will be 1/0.5.

Now the example is more complex. It is known that the part size can be 120,121,122, . . .,180 mm., that is, it has one of 61 values. The prior probability that part size i mm is 1/61.

We have a very imperfect measuring instrument that allows us to measure a part with an accuracy of +5.-5 mm. As a result of the measurements, the size was 130 mm. But in fact it could be 125,126, . . .,135 mm; only 11 values. As a result of the experiment, uncertainty remains, which is characterized by a posteriori probability of 1/11. The degree of uncertainty reduction will be (1/11):(1/61). As above, this ratio is the amount of information.

The logarithmic function is most convenient for reflecting the amount of information. The base of the logarithm is taken to be two. Let us denote the amount of information
- a priori probability,
- posterior probability. Then,

. (1)

In the first example
1 bit of information; in the second
2.46 bits of information. Bit – one binary unit of information .

Now let's turn to the real source of information, which is a set of independent events (messages) with different a priori probabilities
. This set represents data about the parameters of the object and there is information about it. Usually, after the source issues a message, it becomes reliably known which parameter was issued. The posterior probability is 1. The amount of information contained in each event will be equal to

. (2)

This value is always greater than zero. So many events, so much information. This is not entirely convenient for characterizing the source. Therefore, the concept of entropy is introduced. Entropy is the average amount of information per event (message) of the source . It is found according to the rules for determining the mathematical expectation:

. (3)

Or given the properties of the logarithmic function

. (4)

Entropy dimension bits/message. Let us dwell on the properties of entropy. Let's start with an example. Let's say there is a binary source of information with a priori probabilities of events And making up the complete group. From this follows the connection between them:
. Let's find the entropy of the source:

It is not difficult to see that if one of the probabilities is equal to zero, then the second is equal to 1, and the entropy expression will give zero.

Let's plot the dependence of entropy on
(Fig. 1).

Let us note that entropy is maximum at probability equal to 0.5 and is always positive.

The first property of entropy . Entropy is maximum for equally probable events in the source. In our binary source example, this value is 1. If the source is not binary and contains N words, then the maximum entropy.

The second property of entropy. If the probability of one source message is 1, and the others are zero, as forming a complete group of events, then the entropy is zero. Such a source does not generate information.

The third property of entropy is the entropy addition theorem . Let's look at this question in more detail. Let's say there are two sources of information represented by sets of messages And .

Each source has entropies
And
. Next, these sources are combined, and it is required to find the entropy of the combined ensemble
. Every pair of messages And corresponds to probability
. The amount of information in such a pair will be

Proceeding in a well-known manner, we find the average amount of information per pair of ensemble messages. This will be entropy. True, there can be two cases here. The combined ensembles can be statistically independent and dependent.

Consider the first case of independent ensembles, the appearance of the message is in no way defined . Let's write down the expression for entropy:

, (7)

Here
- number of messages in ensembles.

Since with independence the two-dimensional probability , a, from the general previous formula we obtain

Where
And
are determined by known formulas.

Next we will consider a more complex case. Let us assume that message ensembles are in statistical connection, that is with some probability suggests the appearance . This fact is characterized by the conditional probability
; The slash in the notation characterizes the condition. By introducing conditional probabilities, a two-dimensional probability can be defined through the product of one-dimensional ones:

Taking this into account, let us find an expression for entropy. The conversion goes like this:

Given that the sum of all event probabilities is equal to 1, the first double sum in the last expression gives the entropy of the source X, H(x).

The second double sum is called conditional entropy and is denoted as
. Thus,

In a similar way it can be proven that .

In the last expressions we encountered conditional entropy, which is determined by the connection between the combined ensembles of messages. If the ensembles are statistically independent
, and conditional entropy
. As a result, we get the well-known formula.

If messages are absolutely dependent, that is, they are in a functional connection,
takes one of two values: either 1, when
, or 0 when
. The conditional entropy will be equal to 0, since the second ensemble of messages has no surprise, and therefore does not carry information.

After introducing entropy and its properties, let's return to the only source of information. You should know that any source of information works in the current time. Its symbols (signs) occupy a certain place in the sequence. A source of information is called stationary if the probability of a symbol does not depend on its place in the sequence. And one more definition. Source symbols can have a statistical (probabilistic) relationship with each other. An ergodic source of information is one in which the statistical relationship between signs extends to a finite number of previous symbols. If this connection covers only two neighboring signs, then such a source is called a simply connected Markov chain. This is the source we will now consider. The symbol generation scheme by the source is shown in Fig. 2.

Symbol Appearance depends on what character was given out by the source at the previous moment. This dependence is determined by the probability
. Let's find the entropy of such a source. We will proceed from the general understanding of entropy as the mathematical expectation of the amount of information. Let's say two characters are displayed as shown in Fig. 2. The amount of information in such a situation is given by the source

By averaging this amount over all possible subsequent symbols, we obtain the partial entropy, provided that the previous one is always given the symbol :

. (13)

Once again, averaging this partial entropy over all previous symbols, we get the final result:

The index 2 in the entropy designation indicates that the statistical relationship extends only to two adjacent symbols.

Let us dwell on the properties of the entropy of an ergodic source.

When the symbols in the source are independent
, formula (14) is simplified and reduced to the usual form (4).

The presence of statistical (probabilistic) connections between source symbols always leads to a decrease in entropy,
.

So, a source of information has maximum entropy if two conditions are met: all symbols of the source are equally probable (entropy property) and there are no statistical connections between the symbols of the source.

To show how well the source symbols are used, a redundancy parameter is introduced :

. (15)

Magnitude is in the range from 0 to 1.

The attitude towards this parameter is twofold. On the one hand, the less redundancy, the more efficiently the source operates. On the other hand, the greater the redundancy, the less interference and noise affect the delivery of information from such a source to the consumer. For example, the presence of statistical relationships between symbols increases redundancy, but at the same time increases transmission fidelity. Individual missing characters can be predicted and restored.

Let's look at an example. The source is letters of the Russian alphabet, there are 32 of them in total. Let us determine the maximum entropy:
bit/message.

Since there is a statistical relationship between letters and the probabilities of their appearance in the text are far from identical, the real entropy is equal to 3 bits/message. Hence the redundancy
.

The next characteristic of the source is performance; it characterizes the speed of information generation by the source. Let's assume that each letter of the source is issued over a certain period of time . By averaging these times, we find the average time for issuing one message . The average amount of information produced by a source per unit of time - source productivity
:

. (16)

So, let's summarize. The characteristics of an ergodic source of information are as follows:

the amount of information in each sign,

entropy,

redundancy,

performance.

It should be noted that the strength of the introduced measure of the amount of information and, of course, all characteristics is its universality. All the concepts introduced above are applicable to any type of information: sociological, technical, etc. The weak side of the measure is that it does not reflect the significance of the information, its value. Information about winning a pen and a car lottery is equally important.

1.2. Information characteristics of the channel

Let us remember that information is transmitted through a communication channel. We previously introduced the information characteristics of the information source, and now we will introduce the information characteristics of the channel. Let's imagine the situation as shown in Fig. 1.

Rice. 1

At the channel input there is an input alphabet consisting of many characters , and at the output - .

P
Let's represent the communication channel with a mathematical model. The most famous representation of a discrete channel is in the form of a graph. Graph nodes obtained by ( ) and transmitted ( ) letters of the alphabet; the edges reflect possible connections between these letters (Fig. 2).

Relationships between letters of the alphabet are usually assessed by conditional probabilities, for example,
probability of acceptance provided that it is transferred . This is the probability of a correct reception. In the same way, one can introduce conditional probabilities of erroneous techniques, for example,
. The reasons for the appearance of these non-zero probabilities are interference, from which none of the real channels is free. Please note that n and m, the number of characters (letters) in the transmitted and received array are not necessarily equal. Based on this model, further definitions are introduced.

Symmetrical channel – this is a channel in which all the probabilities of correct reception for all symbols are equal, and also the probability of erroneous receptions is equal. For such a channel, the conditional probability can be written as follows:

Here – probability of erroneous reception. If this probability does not depend on what characters were transmitted before a given symbol, such a channel is called " channel without memory "As an example, Fig. 3 below shows the graph of a symmetric binary channel without memory.

R
is. 3

Let us further assume that the alphabet at the output of the channel contains an additional symbol, which appears when the receiver decoder cannot recognize the transmitted symbol. In this case, he develops a refusal to make a decision. This position is called erasure. This channel is called channel without memory with erasing and its graph is shown in Fig. 4. The “erasing” position is indicated here by a question mark.

R
is. 4.

The simplest channel with memory is Markov channel . In it, the probabilities of errors depend on whether the previous symbol was received correctly or erroneously.

Along with the graph for the communication channel, there is another description - channel matrix . This is a set of conditional probabilities
or
. Together with a priori probabilities,
And
this gives a complete picture of the statistics of the noisy channel. For example, let's look at the channel matrix

.

For a source with dependent messages, entropy is also calculated as the mathematical expectation of the amount of information per element of these messages. The amount of information and entropy are logarithmic measures and are measured in the same units.


6. The entropy of the combined statistically independent sources of information is equal to the sum of their entropies. 7. Entropy characterizes the average uncertainty of choosing one state from the ensemble, completely ignoring the substantive side of the ensemble. ECOSYSTEM ENTROPY is a measure of the disorder of an ecosystem, or the amount of energy unavailable for use. The higher the entropy index, the less stable the ecosystem is in time and space.

4.1.2. Entropy and performance of a discrete message source

Any of these messages describes the state of some physical system. We see that the degree of uncertainty of a physical system is determined not only by the number of its possible states, but also by the probabilities of the states. As a measure of a priori uncertainty of a system (or a discontinuous random variable), information theory uses a special characteristic called entropy.

Entropy, as we will see later, has a number of properties that justify its choice as a characteristic of the degree of uncertainty. Finally, and this is the most important thing, it has the property of additivity, that is, when several independent systems are combined into one, their entropies add up. If the number 10 is chosen as the base, then we talk about “decimal units” of entropy, if 2 – about “binary units”.

Let us prove that the entropy of a system with a finite set of states reaches a maximum when all states are equally probable. Example 3. Determine the maximum possible entropy of a system consisting of three elements, each of which can be in four possible states.

It should be noted that the entropy value obtained in this case will be less than for a source of independent messages. This follows from the fact that in the presence of message dependence, the uncertainty of choice decreases and, accordingly, the entropy decreases. Let's determine the entropy of the binary source. The graph of dependence (4.4) is presented in Fig. 4.1. As follows from the graph, the entropy of a binary source varies from zero to one.

Basic properties of entropy

It is usually noted that entropy characterizes a given probability distribution in terms of the degree of uncertainty in the outcome of a test, i.e., the uncertainty in the choice of a particular message. Indeed, it is easy to verify that entropy is zero if and only if one of the probabilities is equal to one and all others are equal to zero; this means complete certainty of choice.

Another visual interpretation of the concept of entropy is possible as a measure of the “diversity” of messages created by a source. It is easy to see that the above properties of entropy are quite consistent with the intuitive idea of ​​the measure of diversity. It is also natural to assume that the more diverse the possibilities for choosing this element are, the greater the amount of information contained in a message element.

An expression representing the mathematical expectation of the amount of information in the selected element for a source located in the ith state can be called the entropy of this state. The source entropy per message element defined above depends on how messages are divided into elements, i.e., on the choice of alphabet. However, entropy has the important property of additivity.

Let us note some properties of entropy. Entropy. This is perhaps one of the most difficult concepts to understand that you can encounter in a physics course, at least when it comes to classical physics.

For example, if you ask me where I live, and I answer: in Russia, then my entropy for you will be high, after all, Russia is a big country. If I tell you my zip code: 603081, then my entropy for you will decrease because you will receive more information.

The entropy of your knowledge of me has decreased by approximately 6 characters. What if I told you that the sum is 59? There are only 10 possible microstates for this macrostate, so its entropy is only one symbol. As you can see, different macrostates have different entropies. We measure entropy as the number of symbols needed to write the number of microstates.

In other words, entropy is how we describe a system. For example, if we heat a gas a little, then the speed of its particles will increase, therefore, the degree of our ignorance about this speed will increase, that is, entropy will increase. Or, if we increase the volume of gas by retracting the piston, our ignorance of the position of the particles will increase, and the entropy will also increase.

On the one hand, this expands the possibilities of using entropy in the analysis of a wide variety of phenomena, but, on the other hand, it requires a certain additional assessment of emerging situations. This is firstly. Secondly, the Universe is not an ordinary finite object with boundaries, it is infinity itself in time and space.

MAXIMUM WORK - in thermodynamics 1) work done by a thermally insulated material. Any message we deal with in information theory is a collection of information about some physical system. Obviously, if the state of the physical system were known in advance, there would be no point in transmitting the message.

Obviously, the information obtained about the system will, generally speaking, be more valuable and meaningful, the greater the uncertainty of the system before receiving this information (“a priori”). To answer this question, let's compare two systems, each of which has some uncertainty.

However, in general this is not the case. Consider, for example, a technical device that can be in two states: 1) operational and 2) faulty. We emphasize that to describe the degree of uncertainty of the system, it is completely unimportant which values ​​are written in the top row of the table; Only the number of these values ​​and their probabilities are important. The concept of entropy is fundamental in information theory.

The amount of this information is called entropy. Let's assume that some message includes elements of the alphabet, elements, etc. The quantity is called the entropy of the message source. 3. Entropy is maximum if all states of message elements are equally probable. In information theory, it is proven that always, i.e., the presence of probabilistic connections reduces the entropy of the message source.

Prologue 113. The meaning of the principle of maximum entropy

Power-law distributions can arise as a result of the principle of maximum entropy - we saw this in Prologue 111 and in Prolog 112 we described a multiplicative collision model built on this basis, which develops a power-law distribution on a certain set of objects.

However, in order to adequately apply this model to explain the origin of power-law distributions that are observed in various natural and human systems, it is necessary to take a close look at its two foundations - the principle of maximum entropy and the multiplicativity of interactions. We will try to think about their “philosophical” meaning. Let's start in order, with the principle of maximum entropy.

Two interpretations of the principle of maximum entropy

In this interpretation, the principle of maximum entropy obviously echoes the second law of thermodynamics - the fundamental law of physics, according to which the entropy of a closed system can either increase or remain unchanged, but not decrease. It follows directly from this that if we take any closed system that has remained so enough long time, we will find it in a state with maximum entropy.

However, historically, the principle of maximum entropy traces its origins to a completely different source - not from thermodynamics, but from probability theory. And it is this source that gives the second interpretation of the principle of maximum entropy, probably more fundamental. It can be formulated like this: from all hypotheses about the shape of the distribution of a random variable, one should choose the one at which the entropy of the distribution is maximum, taking into account the restrictions imposed by our knowledge of the system.

At the beginning of the 18th century, Jacob Bernoulli, reflecting on the foundations of probability theory, formulated the “Principle of Insufficient Cause,” which is considered the forerunner of the principle of maximum entropy. Let us consider two alternative and mutually exclusive outcomes A And B. Bernoulli's principle states that if we have no information about the probabilities of these outcomes, they should be assumed to be equally probable. That is, under these conditions we have insufficient reasons to assign one of the outcomes a higher probability than the other. Note that from Bernoulli's point of view, probabilities reflect our knowledge about the subject. If we have no knowledge about it (except that two outcomes are possible), the probabilities should be assumed equal. Any other probability distribution must have a basis, a reason based on our knowledge of the laws governing the subject.

So, each outcome should be assumed to be equally probable unless there is reason to make a different choice. If different outcomes are different values ​​of some quantity, we must assume a uniform probability distribution. As we know, it is the homogeneous distribution that has the maximum entropy. But Bernoulli did not talk about entropy - he lived and worked two centuries before this concept appeared. To come from the principle of insufficient cause to the principle of maximum entropy, it was necessary to take many steps - and this path was completed only in the middle of the 20th century, and the last steps are associated with the work of the American physicist Edwin Jaynes.

From the principle of insufficient cause to the principle of maximum entropy

However, we, armed with modern concepts, can travel this path much faster, directly. It seems very simple - but only from the height of our current knowledge. And yet, it was Bernoulli who could become the discoverer of both the principle of maximum entropy and the entropy/information calculus itself. He could have, if he had believed a little more in the descriptive ability of numbers - and he certainly believed in it, since it was not for nothing that he became one of the founders of the theory of probability.

So, when we have two alternative outcomes A and B, and nothing else is known, the principle of insufficient reason requires us to assume that they are equally likely: p A=p B=1/2. This is how we introduce a minimum of any biases into our assumptions about the likelihood of outcomes. Let's assume that there is some function of these probabilities H(p A ,p B), which turns out to be maximum if p A=p B=1/2 (or we could accept that in these conditions it is, on the contrary, minimal - this is not important). Let us denote this minimum as H(1/2,1/2). Can we say anything more about this function based on general considerations?

Quite, and Jacob Bernoulli was a master of such things. First, note that if we have only one possible outcome A, it automatically has a probability of one. This means that there is no additional knowledge that we can bring in that could influence our assessment of the probability of the outcome. That is, we have absolutely complete knowledge of the outcome. In this case, it is reasonable to expect that our function, reflecting the amount of knowledge we brought into the assessment of outcomes, takes a minimum value, say, zero: H(1) = 0.

Further, we note that when a situation of two equally probable alternatives is resolved in one way or another, we find ourselves in a situation with one possible outcome - the one chosen by chance. What happens at this moment with the function H? It decreases from the value H(1/2,1/2) to the value H(1)= 0. This difference: H(1/2,1/2)-H(1) = H(1/2,1/2) it is reasonable to count the amount of knowledge we have acquired regarding two equally probable outcomes when the alternative is resolved. Or, in other words, amount of not-knowing or uncertainty in an initial situation with two equally probable outcomes. In modern language this quantity is called entropy.

Let us now know that there can be four outcomes A, B, C, D and nothing more. The principle of insufficient reason requires that we also assign them equal probabilities p A=p B=pC=p D=1/4. But what is the value of the function? H(p A,p B,p C,p D) in this case? Elementary logic leads to the conclusion that its value should be twice as large as for the case of two possible equally probable outcomes: 2* H(1/2,1/2). Indeed, let the outcomes A, B on the one hand and C, D on the other hand be very similar. If we are not very attentive or not very vigilant, we may not distinguish them from each other. Then we return to the case with two outcomes and the uncertainty of the situation is equal to H(1/2,1/2). But we looked closely and saw that in fact, where we saw one outcome, there are actually two close ones. We are again faced with the task of choosing the most “fair” probability distribution between them, and it will again be a uniform distribution. And added to the uncertainty H(1/2,1/2). So for a situation with four equally probable alternatives H(1/4,1/4,1/4,1/4) = 2*H(1/2,1/2). Continuing inductively, we would establish that for a situation with eight outcomes the amount of uncertainty is 3* H(1/2,1/2), etc.

I believe the reader understands that our derivation of the function properties H coincides with the logic leading to Hartley's information/entropy equation. If we denote the number of equally probable outcomes as N, Hartley entropy is equal to

We were introduced to a simple path that leads from Hartley's formula to Shannon's formula - Jacob Bernoulli would have easily discovered it. And if Bernoulli had this formula at his disposal, he could quantify the degree of uncertainty of a certain probability distribution and establish the principle according to which we should assign probabilities to outcomes so that the entropy of the distribution is the maximum of all allowable ones - this is the principle of maximum entropy.

However, history does not know the subjunctive moods, and science has its own leisurely pace.

In conclusion, it is worth noting that the key step is the very first one, in which we assume the existence of some function H, reaching a maximum for equally probable outcomes. Everything else rolls out like a ball. This is further confirmation of the benefits extreme principles, when we consider some normal or correct state of the system to be one in which some function of its state reaches an extreme value.

The main intrigue of the principle of maximum entropy is that it has two interpretations (stemming from two different sources), which even at first glance are radically different in meaning. In the interpretation, which traces its history back to Bernoulli's principle, we are talking about rules for organizing our descriptions of the world. We must describe the world in such a way as not to impose our prejudices on it, which are expressed in assigning unjustified probabilities to various events. Each time we should choose a description that contains nothing other than what we know for sure. This is a heuristic rule that allows you to avoid distortions in descriptions of reality.

The physical interpretation, with the help of which we, in particular, can derive the energy distribution of the molecules of an ideal gas, says something different. It sets the rules that govern not our description of reality, but reality itself.. If a physical system is governed by some law and nothing else, then the distribution of parameters in it 1) will correspond to this law and 2) will have the maximum entropy among the allowed distributions. This statement not about how we can better describe the world, but about the world itself.

When the Cognitive Scientist Manifesto says that the structure of the world corresponds to the structure of our consciousness, we are talking about precisely these amazing “coincidences”: the best choice in constructing our descriptions of the world is also the best choice of nature itself.

To this it can be objected that Bernoulli’s principle allows us to obtain more plausible descriptions of reality, and only for this reason can it be considered true. However, Bernoulli did not derive it empirically at all, without comparing it with reality. He put forward it based on the requirements of logic, based on the properties of reason itself and its abstract constructions. (Moreover, he was aware of the big problem with the practical value of his principle in its original form - only in very rare circumstances in natural phenomena can one see outcomes with equal probabilities.) But it turns out that the world is subject to the same logic, and seems to carry the same mind, just like our own.

We can better appreciate this surprising duality of the maximum entropy principle by contrasting it with one ideologically related principle that has the misfortune of being so well formulated. We will try to fix this.

Occam's razor and the principle of minimum complexity

A close relative of the principle of insufficient reason is Occam's famous razor. This is a rule that invites us, among alternative descriptions of the world, to prefer the simplest one, containing the minimum number of entities and parameters. Reformulating this heuristic, the relatedness of the two principles is easy to discern: among all alternative descriptions, one should choose the one containing the minimum of structural or algorithmic complexity. The point is that you should choose a model or description that has the simplest algorithm. "Algorithmic complexity" is not a figure of speech, it is a quantifiable quantity that has a direct relationship to entropy/information. It is also called algorithmic entropy or Kolmogorov complexity named after the Russian mathematician A. N. Kolmogorov, who introduced this quantity into scientific use. The Kolmogorov complexity of a given string of characters is measured as the length of the program or algorithm required to reproduce that string. The more complex a string of characters is organized, the longer the program that is needed to reproduce it. Of course, the length of the program depends on the programming language, however, this factor can be neglected, assuming that we write programs in some ideal, most economical and concise language.

Let, for example, the following notation in this ideal language mean taking the string "AB" and repeating it 10 times:

We can say that the algorithmic complexity of this line is equal to 5 characters - this is exactly the length of the shortest program generating this line.

Another example: a given string of 20 characters has an algorithmic complexity of 12 characters, because that is the length of the program that generates it:

Let us pay attention to an important point: this unsystematic a string of characters in the sense that we do not see a system in it that would allow us to shorten the algorithm. But that doesn't mean it's random sequence of characters. If we need to reproduce exactly random sequence, we should use another program:

This is paradoxical: a completely random string appears to have the same complexity as a completely ordered one. But in fact, it is not a homogeneous string or a random string that has the highest complexity, but an unsystematic string, which is not at all random, but, on the contrary, extremely regular. This is easy to understand: imagine that we randomly poke our finger into an open book and Always we end up on the same word. It is clear that this situation is fundamentally different from the one when we accidentally end up with different words. We will see the importance of this nuance a little further.

Note that despite the seemingly completely distant relationship between complexity according to Kolmogorov and entropy according to Shannon and Hartley, in fact it is possible to show their deep interrelation - but we will not go into this topic here.

So, we can look at some model or description as an algorithm that reproduces the required set of properties (the required "string"). Then Occam's razor requires choosing a description that has minimal algorithmic entropy.

A historical story that can serve as an example of a situation in which this principle would be useful is the confrontation between the systems of Ptolemy and Copernicus. The Ptolemaic system is a model of the universe based on the naive religious belief that the Earth should be at the center of the universe:

Heavenly bodies, including the Sun, revolve around the Earth in orbits. However, despite the ideological correctness of this design, it had some drawback: within its framework it was impossible to explain the phenomenon of changing the direction of movement of planets across the vault of heaven. Let's say Jupiter moves progressively relative to the stars over the course of several weeks. But then it makes a loop and moves in the opposite direction for some time. Then returns to the “correct” movement. To explain this phenomenon, Ptolemy introduced so-called epicycles into his system - he assumed that in addition to rotating around the Earth, each luminary additionally rotates in a small orbit around a certain center, which in turn rotates around the Earth in a circular orbit. Then those moments when Jupiter moves backward along its epicycle, we see a change in the direction of its movement across the sky.

Copernicus proposed another system: in it the Sun is in the center (the reader has probably heard a lot). The Copernican system was able to explain the loops of Jupiter and other luminaries without introducing epicycles; the simple circular motion of the planets was enough for us from Earth to sometimes see loops in the movement of the planets. Even without going into the accuracy of predictions of the movement of planets across the firmament, the Copernican system obviously has less algorithmic complexity, and at the same time is able to “reproduce the correct line.” Thus, if we are guided by Occam's principle, we should prefer the Copernican system.

But does Occam's razor have its analogue in the properties of reality itself, as does the principle of insufficient cause? The author is confident of a positive answer. Let's try to formulate it, let's call it principle of minimal structural complexity: a system that is potentially capable of having a different structure has a structure that has minimal complexity according to Kolmogorov, taking into account external requirements for the properties of this system.

This is where the difference between random strings and extremely regular ones turns out to be important. By documenting the position and speed of molecules in a vessel with a gas, each time we will receive a set of numbers that is close to random - a “random string”. But if we got the same result every time, this would indicate that the system is in an extremely structurally complex state.

Let us note that there is a very important example for us of structures that have low algorithmic complexity: fractals develop as a result of repetition of the same generating transformations applied to different scale levels. Algorithmically, these are simple structures. Perhaps the principle of minimal structural complexity can explain such a comprehensive prevalence of fractal structures in various phenomena of the world.

However, this is still only a vague idea.

Next, we saw how the principle of maximum entropy is related to the second law of thermodynamics. But perhaps the principle of minimum structural complexity tells us another consequence of the second principle. It can be formulated like this: if at the initial point in time the structure of the system is not the least complex, it evolves in the direction of decreasing complexity, reaching the possible minimum.

If this interpretation of the second law of thermodynamics is correct, a question arises addressed to its usual interpretation: if the entropy of the world as a system is only increasing, why has the universe not yet reached a state of maximum entropy (and minimum structural complexity), which is called “thermal death”? Science cannot answer this question. Maybe - materialists are inclined to this answer - she hasn’t had time yet. Or maybe our universe is not a closed, but an open system and from somewhere it receives a resource that allows it to cope with the second law of thermodynamics. This opinion is shared by idealists, among whom the author counts himself. We do not yet have enough knowledge to put an end to this dilemma.

Let us conclude this Prologue by “sharing the skin of an unkilled bear” and admiring the fact that Occam’s razor is not only capable of cutting off all unnecessary things from our mental constructs, but also cutting off all unnecessary things from the structure of the world, so that it appears before us in the simplest, most elegant appearance of all possible. How can one not recall Leibniz, who believed that we live in the best of possible worlds?

By finding the maximum value of entropy, we obtain the law of distribution of molecules among energy levels, completely analogous to the classical case.
And denotes the maximum entropy value.
If there is a maximum entropy value H (x, y), the system has no organization and the values ​​of x and y are not related to each other.
It is impossible to prove the need for a maximum entropy value for the equilibrium state of a system based on the generalized thermodynamic equation. However, equilibrium is impossible at non-maximum entropy values.
Formula (1.1) expresses the maximum value of entropy (1.8); when all possible states of a system are equally probable, it is the most disordered, and therefore its entropy should have the greatest value.
In other words, the maximum value of the entropy of a corrosion pair with a finite number of states is equal to the logarithm of this number and reaches Smax when all states are equally probable. If the states of the corrosion pair are known in advance, then its entropy is zero.
The state of the system with the maximum entropy value is the state of stable equilibrium. Indeed, in this state, irreversible processes cannot occur in the system, since otherwise the entropy of the system would have to increase, which cannot be the case.
Since the equilibrium state corresponds to the maximum entropy value, and the flows in this state disappear, all parameters in the equilibrium state become zero.
A metastable equilibrium state is also characterized by a maximum entropy value (and minima of energy and thermodynamic potentials), but other equilibrium states are possible for the system, in which, for the same values ​​of energy, volume and amounts of substances, entropy has even greater values.
If the thermodynamic equilibrium, which corresponds to the maximum value of entropy, is only statistical in nature, then one should expect deviations from the most probable values ​​when observing in very small areas. Associated with these density fluctuations is the scattering of light in the atmosphere, in particular the color of the sky; the theory of this phenomenon makes it possible to calculate Avogadro's number from the spectral distribution of scattered light intensity. If there are small, but still noticeable under a microscope particles (colloidal particles) in a liquid, then their irregular trembling is visible, due to the fact that the impacts of liquid molecules from different sides are not exactly balanced at every moment: first on one side, then on the other the particle is hit by a larger number of molecules, and it is displaced in the corresponding direction. The essence of this phenomenon, called Brownian movement (in honor of the English botanist Brown), remained unclear for a long time. But under a microscope, a speed is observed that is many orders of magnitude lower if it is defined in the usual way as the ratio of path to time. In reality, the speed of a particle changes its direction so often that the observed motion of such a particle is only a rough approximation of the true zigzag motion.
If the system is in a state of equilibrium, characterized by a maximum entropy value, then the most likely processes will be those in which the entropy of the system does not change. A comparison of these conclusions with the second law of thermodynamics shows their equivalence.
In this example (with two possible outcomes), the maximum entropy value is one binary unit.
It should be taken into account that the Flory distribution gives the maximum entropy value.
Change in entropy of an isolated system of finite dimensions.
The system is basically in an equilibrium state corresponding to the maximum value of the entropy of the system; Having deviated from this state, the system then returns to it. When observing a system for a long time, it can be noted that cases of increase and decrease in entropy occur equally often, and the repetition time of any deviation of the system from an equilibrium state is longer, the lower the probability of a given non-equilibrium state. As system size increases, repeatability times increase rapidly. Therefore, processes that are irreversible from the point of view of ordinary thermodynamics appear to be practically irreversible from a statistical point of view. This circumstance brings both formulations of the second law of thermodynamics closer together and practically eliminates the difference noted above.
Let us prove for the simplest case (for a single-phase system) that the maximum value of entropy or the minimum value of free energy of the system corresponds to the equidistribution of isotopes. Let us further assume that the compound AX contains a p n atoms of the element X participating in the exchange.
It can be shown that for a given dispersion of states o, the distribution according to the normal law gives the maximum entropy value.
Spontaneous processes in isolated systems can only proceed in the direction of increasing entropy, and the maximum value of entropy corresponds to equilibrium.
By introducing rates and considering non-equilibrium states that represent organisms, we are deprived of such a reliable criterion as the maximum value of entropy, and we must try to find other grounds for selecting states that are stable.
Glass transition and melting temperatures of a number of polymers, areas of their application. When such a system is deformed, the total amount of statistical disorder decreases, so the system tends to return to the state corresponding to the maximum entropy value.
The depth of spontaneous processes is determined by the entropy value of each of the bodies between which some process is carried out, which stops when the maximum entropy value is reached, after which the system enters a thermal equilibrium state, from which it cannot spontaneously exit.
Molecular weight distribution of polyhexamethylene adipamide. When deriving this equation, the basic assumption is made about the independence of the reactivity of molecules from the molecular weight, as well as the assumption about the maximum value of entropy for a given equilibrium fractional composition, about a change in the fractional composition at a given average molecular weight only due to a change in entropy.
The equilibrium of heterogeneous systems corresponds to the equality of the chemical potentials of each component in all phases, as well as the minimum value of isochoric or isobaric potentials or the maximum value of entropy of the entire system under certain conditions. If a system includes at least one phase, the composition of which changes as it approaches equilibrium, then the equilibrium state of the phase and the entire system is characterized by an equilibrium constant, for example, in systems consisting of individual substances in a condensed state and gases. In systems consisting of individual substances in a condensed state, in which the composition of the phases does not change during the process, and the process continues until the complete disappearance of one of the starting substances (for example, polymorphic transformations of substances), the concept of an equilibrium constant is not applicable.
The equilibrium of heterogeneous systems corresponds to the equality of the chemical potentials of each component in all phases, as well as the minimum value of one of the thermodynamic potentials or the maximum value of the entropy of the entire system under appropriate conditions. The most common conditions in practice are constant temperature and constant pressure, so we will evaluate the equilibrium of heterogeneous systems by their isobaric potential.
Taking into account the molecular nature of the working substance and fluctuations of internal parameters in it, it can be noted that without establishing equilibrium in the system, the maximum entropy value cannot be achieved. Fluctuations bring the system to equilibrium. It is fluctuations in systems that lead to the need for maximum entropy at equilibrium whenever this condition is not met, that is, the system is taken out of equilibrium.

Thus, the main reason for elasticity during deformation in a highly elastic state and the occurrence of stress in the sample is a change in conformation and a transition from the equilibrium form of a statistical coil with a maximum entropy value to a nonequilibrium one with a decrease in entropy and a reverse transition after the deformation stops. The contribution of the energy component to this process is small, and for ideal grids it is equal to zero.
THERMAL DEATH OF THE UNIVERSE - the final state of the world, which supposedly arises as a result of the irreversible transformation of all forms of motion into heat, the dissipation of heat in space and the transition of the world to a state of equilibrium with a maximum value of entropy. This conclusion is made on the basis of the absolutization of the second law of thermodynamics and its extension to the entire universe.
THERMAL DEATH OF THE UNIVERSE - the final state of the world, which supposedly arises as a result of the irreversible transformation of all forms of motion into heat, the dissipation of heat in space and the transition of the world to a state of equilibrium with a maximum value of entropy. This conclusion is made on the basis of the absolutization of the second law of thermodynamics and its extension to the entire universe. The formation of stars and galaxies is one of the manifestations of this process. Irreversible change of matter in the universe does not imply k.
The second law of thermodynamics establishes that irreversible processes (and these are practically all thermal processes and, in any case, all naturally occurring processes) proceed in such a way that the entropy of the system of bodies participating in the process grows, tending to its maximum value. The maximum value of entropy is achieved when the system reaches a state of equilibrium.
The property of entropy to increase in irreversible processes, and irreversibility itself, is in conflict with the reversibility of all mechanical movements, and therefore the physical meaning of entropy is not as obvious as, for example, the physical meaning of internal energy. The maximum value of the entropy of a closed system is achieved when the system reaches a state of thermodynamic equilibrium. This quantitative formulation of the second law of thermodynamics was given by Clausius, and its molecular-kinetic interpretation by Boltzmann, who introduced statistical concepts into the theory of heat based on the fact that the irreversibility of thermal processes is probabilistic in nature.
Relationship (IX.2) expresses the fact that for the equilibrium state of an isolated system there is a conditional maximum of entropy. The maximum value of the entropy of an isolated system is determined by the given values ​​of the energy and volume of the system, as well as the masses, and therefore the number of moles of the components.
The growth of entropy in any process does not continue indefinitely, but only up to a certain maximum value characteristic of a given system. This maximum entropy value corresponds to a state of equilibrium, and after it is reached, any changes in state without external influence cease.
The growth of entropy in any process does not continue indefinitely, but only up to a certain maximum value characteristic of a given system. This maximum entropy value corresponds to a state of equilibrium, and after it is reached, any changes in state without external influence cease.
Thus, in the case of equal probability of input events, entropy corresponds to the amount of information for equally probable outcomes. Hartley corresponds to the maximum entropy value. Physically, this defines the case when the uncertainty is so great that it is difficult to predict.
Nshks is the maximum 2ptropium possible for all compositions with a given number of components. Obviously, compositions in which all components are in equal concentrations have the maximum entropy value.
As we can see, the highest thermodynamic probability will occur when the molecules are evenly distributed over the areas. This uniform distribution corresponds to the maximum entropy value.
A more rigorous development of this issue is given in statistical thermodynamics. We only note that the maximum entropy value corresponding to the equilibrium state is considered only as the most probable. Over a sufficiently long period of time, deviations from it are possible. In macrosystems, this requires times of astronomical order. In microscopic volumes, inside the bodies around us, such changes occur constantly.
It is clear from this that these processes will continue until the entropy of the system reaches a maximum. The state of an isolated system with a maximum entropy value is a state of stable equilibrium.
It is clear from this that these processes will continue until the entropy of the system reaches a maximum. The state of an isolated system with a maximum entropy value is a state of stable equilibrium.

The statistical nature of the law of increasing entropy follows from the very definition of entropy (III.70), which connects this function with the probability of a given macroscopic state of the system. However, the equilibrium state, which corresponds to the maximum value of the entropy of an isolated system, is most likely, and for macroscopic systems the maximum is extremely sharp. Almost the entire volume of the energy layer corresponds to the equilibrium state of a macroscopic isolated system, and the representing point of the system is located precisely in this region with a probability close to unity. If the system is not in a state corresponding to the equilibrium value of the macroscopic parameter X (to within the DH interval), it will almost certainly come to this state; if the system is already in this state, it will very rarely leave it.
The statistical nature of the law of increasing entropy follows from the very definition of entropy (II 1.63), which connects this function with the probability of a given macroscopic state of the system. However, the equilibrium state, which corresponds to the maximum value of the entropy of an isolated system, is most likely, and for macroscopic systems the maximum is extremely sharp. Almost the entire volume of the energy layer corresponds to the equilibrium state of a macroscopic isolated system, and the representing point of the system with an accuracy close to unity is located precisely in this region; the system is not in a state to which the equilibrium value of the macroscopic parameter X corresponds (to within the interval AX), she will almost certainly come to this state; if the system is already in this state, it will very rarely leave it.
The most general equilibrium conditions follow from the statement of the second law of thermodynamics about the growth of entropy of an adiabatically isolated system when irreversible processes occur in it. If a certain state of such a system is characterized by a maximum entropy value, then this state cannot be nonequilibrium, since otherwise, during relaxation, the entropy of the system would increase according to the second law, which is not consistent with the assumption of its maximum. Consequently, the condition of maximum entropy of an isolated system is a sufficient condition for its equilibrium.