The Markov chain is simple: let’s look at the principle in detail. Expanding the vocabulary base

In web construction and SEO, Markov chains are used to generate pseudo-meaningful texts based on source texts. This is used for stamping doorways with specified keywords, for typing content text mass and similar “black” tricks. Fortunately, search engines have learned to effectively identify content created based on Markov chains and ban such clever people. I’m not going to teach you such technologies, there are special shitty sites for that, I’m only interested in the software implementation of the algorithm.

A Markov chain is a sequence of tests in each of which only one of k appears. incompatible events Ai from full group. Wherein conditional probability pij(s) of the fact that event Aj occurs in the sth trial, provided that event Ai occurs in the (s - 1)th trial, does not depend on the results of previous trials.

Those who want to blow their brains can read about mathematical model. In human language, all these formulas boil down to the following. In the source text, words are defined and the sequence of which words come after which is maintained. Then, based on this data, it is created new text, in which the words themselves are chosen randomly, but the connections between them are preserved. Let's take a nursery rhyme as an example:

Because of the forest, because of the mountains
Grandfather Egor is coming:
myself on a horse,
wife on a cow,
children on calves,
grandchildren on baby goats.

Let's parse the text into links and links

Because of [forest, mountains]
forests [due to]
mountains [rides]
[grandfather] is coming
grandfather [Egor]
Egor [himself]
myself [on]
on [horse, cow, calves, kids]
horse [wife]
wife [on]
cow [children]
children [on]
calves [grandchildren]
grandchildren [on]

The links in this list represent unique words from the text, and in square brackets connections are listed - a list of words that can appear after this word.

When generating text from a list of links, at the first iteration, a random link is selected, its connections are determined, a random link is selected from the list of links and accepted as a new link. Then the action is repeated until reaching the right size text. The result, for example, might be something like this:

Egor himself on a calf, grandchildren on a horse, wife on a cow, children on a cow
In this example, the resulting text differs little from the original one, since original text very short. If you take an initial dictionary of several kilobytes or even megabytes, the output will be a completely coherent text, although it will not make any sense.

// Read the source text on the basis of which a new one will be generated
$str = file_get_contents("markov.txt");
// Set system encoding
setlocale(LC_ALL, "ru_RU.CP1251");
// Remove characters from the text except numbers, letters and some punctuation marks
$str = eregi_replace ("[^-a-zа-я0-9 !\?\.\,]" , " " , $str );
// Clean up spaces before ending sentences
$str = eregi_replace (" (1,)([!\?\.\,])" , "\\1" , $str );
// Divide the text into words
$tmp = preg_split ("/[[:space:]]+/is" , $str );
// Array of "links"
$words =Array();
// Fill the links
for($i = 0 ; $i< count ($tmp ); $i ++) {
if ($tmp [ $i + 1 ]!= "" ) (
$words [ $tmp [ $i ]]= $tmp [ $i + 1 ];
$words = array_map("array_unique" , $words );
// Array of initial words in sentences
$start =Array();
foreach($words as $word => $links ) (
if (ereg ("^[A-Z][a-Z]+" , $word )) (
$start = $word ;
// Generate 100 sentences based on the source text
for ($i = 0; $i< 100 ; $i ++) {
while (true) (
$w = $start [ rand (0 ,(count ($start )- 1 ))];
if (ereg ("[\.!\?]$" , $w )) ( continue; )
$sentence = $w . " " ;
// Number of words in a sentence
$cnt = 1 ;
// Generate offer
while(true) (
$links = $words [ $w ];
// Switch chain
$w = $words [ $w ][ rand (0 ,(count ($words [ $w ])- 1 ))];
$sentence .= $w . " " ;
// If the word was at the end of the sentence
if (ereg ("[\.!\?]$" , $w )) ( break; )
$cnt++;
// If the generator is in a loop, then force exit
if ($cnt > 19) ( break; )
// A sentence with a length of 5-20 words is considered successful
if ($cnt > 5 && $cnt< 20 ) { break; }
// Generated offer
echo $sentence ;

A little explanation of how it all works. First, the file "markov.txt" is loaded, it must be in win-1251 encoding. Then all characters are removed from it, except letters and some punctuation marks, and then unnecessary spaces are cut out. It turns out clear text, which is then divided into individual words. That's it, we have individual links in the chain. Now we need to determine the connections between words, that is, which words can be located behind which ones. This is the most resource-intensive process, so you will have to be patient on large files. If generation is required frequently, then it probably makes sense to store an array of links and links in some database in order to have quick access to it. Next step- identifying the words with which sentences begin. I accepted the condition that the first letter of such words should be capitalized, you can do more precise definition. Text generation is carried out according to the algorithm described above, I just added several checks against looping to it.

You can see a working example of a text generator based on Markov chains and the above script

This article gives general idea about how to generate texts using Markov process modeling. In particular, we'll introduce Markov chains and, as practice, we'll implement a small text generator in Python.

To begin with, let’s write down the necessary, but not yet very clear, definitions from the Wikipedia page in order to at least roughly understand what we are dealing with:

Markov process t t

Markov chain

What does all of this mean? Let's figure it out.

Basics

The first example is extremely simple. Using a sentence from a children's book, we will master basic concept Markov chains, and also define what they are in our context body, links, probability distribution and histograms. Although the proposal is given on English language, the essence of the theory will be easy to grasp.

This proposal is frame, that is, the base on the basis of which the text will be generated in the future. It consists of eight words, but at the same time unique words only five is links(we are talking about Markovian chains). For clarity, let’s color each link in its own color:

And we write down the number of appearances of each link in the text:

In the picture above you can see that the word "fish" appears in the text 4 times more often than each of the other words ( "One", "two", "red", "blue"). That is, the probability of encountering the word in our corpus "fish" 4 times higher than the probability of encountering every other word shown in the figure. Speaking in the language of mathematics, we can determine the distribution law of a random variable and calculate with what probability one of the words will appear in the text after the current one. The probability is calculated as follows: we need to divide the number of occurrences of the word we need in the corpus by total number all the words in it. For the word "fish" this probability is 50% since it appears 4 times in an 8 word sentence. For each of the remaining links, this probability is 12.5% (1/8).

Graphically represent the distribution random variables possible using histograms. IN in this case, the frequency of occurrence of each of the links in the sentence is clearly visible:

So, our text consists of words and unique links, and we displayed the probability distribution of the appearance of each link in a sentence on a histogram. If you think it's not worth bothering with statistics, read on. And perhaps it will save your life.

The essence of the definition

Now let's add to our text elements that are always implied, but not voiced in everyday speech - the beginning and end of the sentence:

Any sentence contains these invisible “beginning” and “end”; let’s add them as links to our distribution:

Let's return to the definition given at the beginning of the article:

Markov process - random process, whose evolution after any set value time parameter t does not depend on the evolution that preceded t, provided that the value of the process at this moment is fixed.

Markov chain - special case Markov process, when the space of its states is discrete (i.e., no more than countable).

So what does this mean? Roughly speaking, we are modeling a process in which the state of the system at the next moment in time depends only on its state at this moment, and does not depend in any way on all previous states.

Imagine what's in front of you window, which displays only the current state of the system (in our case, this is one word), and you need to determine what it will be next word, based only on the data presented in this window. In our corpus, words follow one another according to the following pattern:

Thus, pairs of words are formed (even the end of the sentence has its own pair - an empty meaning):

Let's group these pairs by the first word. We will see that each word has its own set of links, which in the context of our sentence can follow him:

Let's present this information in another way - for each link we assign an array of all words that may appear in the text after this link:

Let's take a closer look. We see that each link has words that can come after it in a sentence. If we were to show the diagram above to someone else, that person could, with some probability, reconstruct our initial offer, that is, the body.

Example. Let's start with the word "Start". Next, select the word "One", since according to our scheme this is the only word that can follow the beginning of a sentence. Behind the Word "One" also only one word can follow - "fish". Now the new proposal in the intermediate version looks like "One fish". Further the situation becomes more complicated - for "fish" there can be words with equal probability of 25% "two", "red", "blue" and end of sentence "End". If we assume that the next word is "two", reconstruction will continue. But we can choose a link "End". In this case, based on our scheme, a sentence will be randomly generated that is very different from the corpus - "One fish".

We have just simulated a Markov process - we determined each next word only on the basis of knowledge about the current one. To fully understand the material, let's build diagrams showing the dependencies between the elements inside our corpus. The ovals represent links. The arrows lead to potential links that can follow the word in the oval. Next to each arrow is the probability with which the next link will appear after the current one:

Great! We have learned necessary information to move on and analyze more complex models.

Expanding the vocabulary base

In this part of the article we will build a model according to the same principle as before, but in the description we will omit some steps. If you have any difficulties, return to the theory in the first block.

Let's take four more quotes from the same author (also in English, it won't hurt us):

"Today you are you. That is truer than true. There is no one alive who is you-er than you.”

« You have brains in your head. You have feet in your shoes. You can steer yourself any direction you choose. You're on your own."

“The more that you read, the more things you will know.” The more that you learn, the more places you’ll go.”

"Think left and think right and think low and think high. Oh, the thinks you can think up if only you try.”

The complexity of the corpus has increased, but in our case this is only a plus - now the text generator will be able to produce more meaningful sentences. The fact is that in any language there are words that appear in speech more often than others (for example, we use the preposition “in” much more often than the word “cryogenic”). How more words in our corpus (and therefore the dependencies between them), the more information the generator has about which word is most likely to appear in the text after the current one.

The easiest way to explain this is from a program point of view. We know that for each link there is a set of words that can follow it. And also, each word is characterized by the number of its appearances in the text. We need some way to capture all this information in one place; For this purpose, a dictionary storing “(key, value)” pairs is best suited. The dictionary key will record the current state of the system, that is, one of the links of the body (for example, "the" in the picture below); and another dictionary will be stored in the dictionary value. In the nested dictionary, the keys will be words that can appear in the text after the current link of the corpus ( "thinks" And "more" may go after in the text "the"), and the values are the number of appearances of these words in the text after our link (the word "thinks" appears in the text after the word "the" 1 time, word "more" after the word "the"- 4 times):

Re-read the paragraph above several times to make sure you understand it exactly. Please note that the nested dictionary in this case is the same histogram; it helps us track links and the frequency of their appearance in the text relative to other words. It should be noted that even such a vocabulary base is very small for proper generation of texts in natural language- it should contain more than 20,000 words, or better yet, more than 100,000. And even better, more than 500,000. But let's look at the vocabulary base that we have.

The Markov chain in this case is constructed similarly to the first example - each next word is selected only on the basis of knowledge about the current word, all other words are not taken into account. But thanks to the storage in the dictionary of data about which words appear more often than others, we can accept when choosing informed decision. Let's look at a specific example:

That is, if the current word is the word "more", after it there can be words with an equal probability of 25% "things" And "places", and with a probability of 50% - the word "that". But the probabilities can all be equal:

Think:

Working with Windows

Until now, we have only considered windows the size of one word. You can increase the size of the window so that the text generator produces more “verified” sentences. This means that the larger the window, the smaller the deviations from the body during generation. Increasing the window size corresponds to the transition of the Markov chain to more high order. Previously, we built a first-order circuit; for a window, two words will produce a second-order circuit, three will produce a third-order circuit, and so on.

Window- this is the data in current state systems that are used to make decisions. If we match big window and a small set of data, then most likely we will receive the same sentence every time. Let's take the vocabulary base from our first example and expand the window to size 2:

The extension has meant that each window now has only one option for the next state of the system - no matter what we do, we will always receive the same sentence, identical to our case. Therefore, in order to experiment with windows, and for the text generator to return unique content, stock up on a vocabulary base of at least 500,000 words.

Implementation in Python

Dictogram data structure

A Dictogram (dict is a built-in dictionary data type in Python) will display the relationship between links and their frequency of occurrence in the text, that is, their distribution. But at the same time, it will have the dictionary property we need - the execution time of the program will not depend on the amount of input data, which means we are creating an effective algorithm.

Import random class Dictogram(dict): def __init__(self, iterable=None): # Initialize our distribution as new object class, # add existing elements super(Dictogram, self).__init__() self.types = 0 # number of unique keys in the distribution self.tokens = 0 # total number of all words in the distribution if iterable: self.update(iterable) def update (self, iterable): # Update the distribution with elements from the existing # iterable data set for item in iterable: if item in self: self += 1 self.tokens += 1 else: self = 1 self.types += 1 self.tokens += 1 def count(self, item): # Return the item's counter value, or 0 if item in self: return self return 0 def return_random_word(self): random_key = random.sample(self, 1) # Another way: # random .choice(histogram.keys()) return random_key def return_weighted_random_word(self): # Generate a pseudo-random number between 0 and (n-1), # where n is the total number of words random_int = random.randint(0, self.tokens-1 ) index = 0 list_of_keys = self.keys() # print "random index:", random_int for i in range(0, self.types): index += self] # print index if(index > random_int): # print list_of_keys [i] return list_of_keys[i]

The constructor of the Dictogram structure can be passed any object that can be iterated over. The elements of this object will be the words to initialize the Dictogram, for example, all the words from a book. In this case, we count the elements so that to access any of them we do not need to go through the entire data set each time.

We also made two functions to return a random word. One function selects a random key in the dictionary, and the other, taking into account the number of occurrences of each word in the text, returns the word we need.

Markov chain structure

from histograms import Dictogram def make_markov_model(data): markov_model = dict() for i in range(0, len(data)-1): if data[i] in markov_model: # Just append to an existing distribution markov_model].update( ]) else: markov_model] = Dictogram() return markov_model

In the implementation above, we have a dictionary that stores windows as a key in a “(key, value)” pair and distributions as values in that pair.

Nth order Markov chain structure

from histograms import Dictogram def make_higher_order_markov_model(order, data): markov_model = dict() for i in range(0, len(data)-order): # Create a window window = tuple(data) # Add to the dictionary if window in markov_model: # Attach to an existing distribution markov_model.update() else: markov_model = Dictogram() return markov_model

Very similar to a first order Markov chain, but in this case we store motorcade as a key in a “(key, value)” pair in a dictionary. We use it instead of a list, since the tuple is protected from any changes, and this is important for us - after all, the keys in the dictionary should not change.

Model parsing

Great, we've implemented the dictionary. But how can we now generate content based on the current state and the step to the next state? Let's go through our model:

From histograms import Dictogram import random from collections import deque import re def generate_random_start(model): # To generate any starting word, uncomment the line: # return random.choice(model.keys()) # To generate the "correct" starting word, use code below: # Correct initial words- these are those that were the beginning of sentences in the corpus if "END" in model: seed_word = "END" while seed_word == "END": seed_word = model["END"].return_weighted_random_word() return seed_word return random.choice(model .keys()) def generate_random_sentence(length, markov_model): current_word = generate_random_start(markov_model) sentence = for i in range(0, length): current_dictogram = markov_model random_weighted_word = current_dictogram.return_weighted_random_word() current_word = random_weighted_word sentence.append(current_word) sentence = sentence.capitalize() return " ".join(sentence) + "." return sentence

What's next?

Try to think of where you can use a text generator based on Markov chains yourself. Just don’t forget that the most important thing is how you parse the model and what special restrictions you set on generation. The author of this article, for example, when creating the tweet generator, used a large window, limited the generated content to 140 characters, and used only “correct” words to begin sentences, that is, those that were the beginning of sentences in the corpus.

It was described how to train a neutron network so that it plays Mario or controls a robot. But can neural network generate text? Markov chains can help with this.

This is why I “love” the Russian-language Wikipedia, because any simple phenomenon/equation/rule, especially from mathematics, is immediately described in such a general way, with such mind-boggling formulas that you can’t figure it out without half a liter. Moreover, the authors of the articles do not bother to give a simple description (at least a couple of sentences) human language, and go straight to the formulas.

If someone wants to know what Markov chains are, then in the first translation he will find out that:
"Markov chain is a sequence random events with a finite or countable number of outcomes, characterized by the property that, loosely speaking, with a fixed present, the future is independent of the past. Named in honor of A. A. Markov (senior)."

And this despite the fact that the basic idea of Markov chains is very simple, but it is simply impossible to understand this from Wikipedia without a mathematical education.

Markov chains are just a description of the probabilities of a system transitioning from one state to another. All states can be described by the vertices of the graph. For example, such vertices can be human positions: [lying], [sitting], [standing], [walking]

Here you can see that the graph is directed, which means that it is not possible to get from every state to another. For example, if you are lying down, it is impossible to walk right away. You have to sit down first, then stand up, and only then walk. But you can fall and end up lying down from any position))
Every connection has a certain probability. So, for example, the probability of falling from a standing position is very small; it is much more likely to stand further, walk or sit down. The sum of all probabilities is 1.

Among other things, Markov chains allow you to generate events. One way or another, most text generators are built on Markov chains.

Let's try to write a generator of pies.

Pies

Pies - quatrains without rhyme, punctuation, numbers, without capital letters. The number of syllables should be 9-8-9-8.

Most text generators use morphological analyzers. But we'll make it easier. Let's just break the words into syllables and calculate the probability that one syllable comes after another syllable. That is, the nodes of the graph will be the syllables, the edges, and their weights - the frequency of the second syllable following the first.
Next, we’ll feed the program fifty pies.

For example, after the syllable “at” there may be the following syllables (edges and their weights):
"chem" (1) "ho" (4) "me" (1) "du" (2) "chi" (4) "yatel" (4) "went" (5) "ku" (1) " " (9) "su"(1) "vych"(3) "mi"(1) "kos"(1) "ob"(1) "det"(2) "drove"(1) "uchi"(1 ) "mu"(1) "bi"(1) "tse"(1) "int"(2) "tom"(1) "ko"(1) "shaft"(1) "nes"(1) " det"(1) "but"(1) "vez"(1) "meth"(1) "vet"(1) "dia"(1) "you"(1)

Now all you need to do is take a random syllable (for example, "at"). The sum of all the syllables that come after it is 58. Now you need to take the next syllable, taking into account the frequency (number) of these syllables:

size_t nth = rand() % count;

size_t all = 0 ;

for (const auto &n: next) (

All += n.count;

if (all >= nth)

return n.word;

Thus, we generate lines so that the first line has 9 syllables, the second - 8, then 9 and 8, we get:

Once upon a time there was a joke about the alarm clock
he was kidnapped while it was
your boss is here yes
Onegin effects sofa

So far it doesn’t look like a particularly coherent text. Non-existent words (“poku”) are often encountered. There is now only one syllable as a key. But it’s difficult to build a sentence based on one syllable. Let's increase the number of syllables on the basis of which we will generate the next syllable to at least 3:

Enough asphalt for the mind
it's seven o'clock divided by
the table is taken out, the black box is
he grew up, took revenge, found
Here is the first pie that can more or less be mistaken for being written by a person.
To make the text more meaningful, you need to use morphological analyzers, and then the nodes will not be syllables, but meta-descriptions of words (for example, “verb, plural, past tense").

Such programs already allow you to write more “meaningful” text. For example, a rooter is an article written by a scientific text generator, which was reviewed and even published in a scientific journal.

A Markov chain is a series of events in which each subsequent event depends on the previous one. In this article we will examine this concept in more detail.

A Markov chain is a common and fairly simple way to model random events. Used in most different areas, from text generation to financial modeling. The most famous example is SubredditSimulator. In this case, Markov Chain is used to automate content creation throughout the subreddit.

Markov chain is clear and easy to use because it can be implemented without using any statistical or mathematical concepts. Markov chain is ideal for learning probabilistic modeling and data science.

Scenario

Imagine there are only two weather conditions: It can be either sunny or cloudy. You can always accurately determine the weather at the current moment. Guaranteed to be clear or cloudy.

Now you want to learn how to predict the weather for tomorrow. Intuitively, you understand that the weather cannot change dramatically in one day. This is influenced by many factors. Tomorrow's weather directly depends on the current one, etc. Thus, in order to predict the weather, you collect data for several years and come to the conclusion that after a cloudy day, the probability of a sunny day is 0.25. It is logical to assume that the probability of two cloudy days in a row is 0.75, since we have only two possible weather conditions.

Now you can forecast the weather several days in advance based on the current weather.

This example shows key concepts Markov chains. A Markov chain consists of a set of transitions that are determined by a probability distribution, which in turn satisfy the Markov property.

Please note that in the example the probability distribution depends only on the transitions with current day to the next one. This unique property Markov process– it does this without using memory. Typically, this approach is not able to create a sequence in which any trend is observed. For example, while a Markov chain can simulate writing style based on the frequency of use of a word, it cannot create texts with deep meaning, since it can only work with large texts. This is why a Markov chain cannot produce context-dependent content.

Model

Formally, a Markov chain is a probabilistic automaton. The transition probability distribution is usually represented as a matrix. If a Markov chain has N possible states, then the matrix will have the form N x N, in which the entry (I, J) will be the probability of transition from state I to state J. In addition, such a matrix must be stochastic, that is, the rows or columns must add up to one. In such a matrix, each row will have its own probability distribution.

General view of a Markov chain with states in the form of circles and edges in the form of transitions.

An example transition matrix with three possible states.

The Markov chain has initial vector states, represented as an N x 1 matrix. It describes the probability distributions of the start in each of the N possible states. The entry I describes the probability of the chain starting in state I.

These two structures are quite sufficient to represent a Markov chain.

We've already discussed how to get the probability of a transition from one state to another, but what about getting that probability in a few steps? To do this, we need to determine the probability of transition from state I to state J in M steps. It's actually very simple. The transition matrix P can be determined by computing (I, J) by raising P to the power M. For small values of M, this can be done manually by repeated multiplication. But for large values M if you are familiar with linear algebra, more effective way raising a matrix to a power will first diagonalize that matrix.

Markov chain: conclusion

Now, knowing what a Markov chain is, you can easily implement it in one of the programming languages. Simple chains Markov are the foundation for studying more complex methods modeling.