The best speech synthesizers online.

Do you crave information, but your brain is tired of visually perceiving it? At the end of the workday you have no concentration left? Or maybe you're just too lazy to read?

Before eradicating laziness in yourself, it would not be a bad idea to make friends with it. In fact, this is a faithful and wise friend who, in collaboration with a couple of brain impulses, can teach many subtleties in life, where you can bypass not only big mountains, but even small hillocks. How can you live without stress? Laziness knows the answer to this question for each individual situation.

For example, so that a person does not strain and protect his vision, voice synthesis engines have been invented - artificial intelligence that can convert text into voice speech and vice versa - speech into text.

The recipe for perceiving any printed information by ear is simple: install on your computer any of the programs for reading text in a voice like Govorilka or Balabolka, supplement it with a Russian-language voice engine with speakers like Alena, Nikolai, Olga or Katerina, copy the text into the program and start playing artificial speech. But even this recipe can be simplified. You can listen to publications on Internet sites in a couple of clicks by installing a special extension in your browser that is designed to convert text to speech.

Chrome extension SpeakIt!: speech synthesizer inside the browser for those who are too lazy to read

Extension for Chromium-based browsers SpeakIt! can read text in more than 50 languages, including Russian. Russian-language voice engines are already installed in it, so no additional steps are needed to install individual software components. All you need to do, overcoming your laziness for a couple of minutes, is install from the store Google Chrome SpeakIt! extension. It comes completely free.

After installing the SpeakIt! is built into the browser toolbar with a button in the form of a speaker icon. Clicking the left mouse button will open a mini-interface of the extension with a button to start speech playback and a stop button.

Right click on the SpeakIt! will open a context menu where we need to select the “Options” command.

Here, in the extension settings, from the drop-down lists we can set a voice engine different from the preinstalled one with a Russian-speaking announcer, we can select female or male voice. We can also adjust the level and speed of speech playback by dragging the sliders of the corresponding options.

The choice of a Russian-speaking announcer as male or in a female voice present only for the iSpeech voice engine. Speakit engine! can only read in a female voice. And the native engine can be safely used by those who are accustomed to the velvety voice of Nikolay’s Digalo. Enter any phrase into the “Test” field and test several voice engines and speakers. This will help everyone choose the optimal speech reproduction for themselves. If the online publication is in English, then we test English-language voice engines.

Well, we’ve decided on the extension settings, now let’s proceed directly to the process of reproducing Internet publications in an artificial voice. On the web page of the publication you are interested in, select the text that you want to reproduce, then either call the context menu and click the SpeakIt! button, or click the button for this extension on the browser panel.

Today we want to talk about an interesting scenario that can certainly be useful in the field of E-commerce. It's about about customer service automation, namely:

The client calls the online store and is asked to enter the order number;
The values entered by the subscriber via DTMF are transferred to the AGI script;
Using the order number, we generate an SQL query to the database where we store information about orders. From the corresponding table we get the order status and customer name;
We generate a string that needs to be spoken to the client and send it for audio generation towards the Yandex.SpeechKit API (TTS technology - text to speech);
We receive an audio file from Yandex, decode it into the format we need (.wav, 8k) and play it back to the client;
We delete the reproduced file and end the client’s call;

In our opinion, this is an interesting automation. Let's start setting up? :)

Obtaining an API - Yandex.SpeechKit token

To get acquainted with the technology, Yandex provides a free trial period of 1 month from the moment the first request is sent. After this, to continue using Yandex. SpeechKit Cloud needs to sign a contract. Details of the terms of use can be read.

First of all, go to the developer’s account using the link https://developer.tech.yandex.ru and click Get the key:

Key name- enter a name for the key. For example, Asterisk + TTS;
Connection- select from the list SpeechKit Cloud;

We remember the value that is highlighted in red in the screenshot above - this is your token. Let's move on to setting up the AGI script.

Create a table with orders

Let's create a SQL table in which we will store order data. IN laboratory purposes, we will deploy it on the same host as the IP-PBX Asterisk (+ this will reduce the delay and processing time). So, enter the following commands in the server console (connect via SSH first):

Use asteriskcdrdb; CREATE TABLE zakazy(name varchar(20),phone varchar(20),nomerzakaza varchar(20),status varchar(20)); INSERT INTO zakazy (name, phone, nomerzakaza, status) VALUES ("Alexander", "79257777777", 300388, "Sent"); INSERT INTO zakazy (name, phone, nomerzakaza, status) VALUES ("Ivan", "79251111111", 476656, "Paid"); INSERT INTO zakazy (name, phone, nomerzakaza, status) VALUES ("Sergey", "79252222222", 0089822, "Delivered");

We created and filled out a table. Now you need to create a user who can have SELECT access to the table:

CREATE USER "mysql_login"@"localhost" IDENTIFIED BY "mysql_password"; GRANT SELECT ON asteriskcdrdb.zakazy TO "mysql_login";

Remember your username and password and proceed to the next step - adapting the AGI script. Traditionally, code comments after the double slash // :

AGI - script

Below is the structure of the script:

#!/usr/bin/php -q get_data("custom/generate", 6000, 10); //accept DTMF from the client; $number= $result["result"]; //record the order number entered by the client into a variable; $hostname = "localhost"; // we have localhost. You may have the IP address of the server on which the database with orders is stored (pre-configure pg_hba.conf on the remote host); $username = "mysql_login"; // login that you created earlier; $password = "mysql_password"; // the password you created earlier; $dbName = "asteriskcdrdb"; mysql_connect($hostname,$username,$password) OR DIE("Can't create connection "); mysql_select_db($dbName) or die(mysql_error()); $query = "SELECT * FROM zakazy WHERE `nomerzakaza`="$number";"; // connect and parse data by order number; $res=mysql_query($query) or die(mysql_error()); while ($row = mysql_fetch_assoc($res)) ( $status = $row["status"]; $name = $row["name"]; // write the name and status obtained from SQL to variables; ); $str = "Dear ".$name."! Your order status is ".$status." Thank you for contacting, all the best!"; // form the string that needs to be synthesized; $qs = http_build_query(array("format" => "wav","lang" => "ru-RU","speaker" => "jane","key" => "your_token","emotion" => "good", "text" => $str)); //describe the variables that will be sent to the Yandex API. You can adjust the file format, locale, speaker (male or female voices) and emotional coloring. Replace "your_token" with the key received from the Yandex API. SpeechKit Cloud; $ctx = stream_context_create(array("http"=>array("method"=>"GET","header"=>"Referer: \r\n"))); $soundfile = file_get_contents("https://tts.voicetech.yandex.net/generate?".$qs, false, $ctx); $file = fopen("file1.wav", "w"); fwrite($file, $soundfile); fclose($file); // get the audio file (save it as file1.wav); shell_exec("sox -t raw -r 48k -e signed-integer -b 16 -c 1 file1.wav -t wav -r 8k -c 1 /var/lib/asterisk/sounds/ru/custom/output1.wav" ); // convert the audio to the audio format required for Asterisk and copy it to the /var/lib/asterisk/sounds/ru/custom/ directory; shell_exec("chown asterisk:asterisk /var/lib/asterisk/sounds/ru/custom/output1.wav"); shell_exec("chmod 775 /var/lib/asterisk/sounds/ru/custom/output1.wav"); // give the file the necessary permissions; $agi->exec("Playback","custom/output1"); // send the command to AGI to play the received audio file; shell_exec("rm -f /var/lib/asterisk/sounds/ru/custom/output1.wav"); shell_exec("rm -f file1.wav"); // delete both files; ? > Download AGI script

After downloading the file, save it with the extension .php

Save the script under the name tts.php in the /var/lib/asterisk/agi-bin directory and issue the following commands to the server console:

Dos2unix /var/lib/asterisk/agi-bin/tts.php chown asterisk:asterisk /var/lib/asterisk/agi-bin/tts.php chmod 775 /var/lib/asterisk/agi-bin/tts.php

We adapt the functionality to production

So, first of all, open the /etc/asterisk/extensions_custom.conf file for editing and add the following entry to it:

Exten => s,1,Answer() exten => s,2,AGI(tts.php)

Very good. Let's make a call to a custom context from FreePBX. To do this, we will use the module. Let's move along the path Admin → Custom Destinations and press Add Destination:

Click Submit And Apply Config. We want the client to be able to find out the status of his order from the main IVR menu by pressing 4. Go to the main IVR and to the sections IVR Entries add the following:

Ready. If something doesn’t work out, write to us in the comments, we’ll try to help :)

Was this article useful to you?

Please tell me why?

We are sorry that the article was not useful for you: (Please, if it is not difficult, indicate why? We will be very grateful for a detailed answer. Thank you for helping us become better!

At Yet another Conference 2013, we presented our new library Yandex SpeechKit. This is a public API for speech recognition that can be used by Android and iOS developers. You can download SpeechKit and also read the documentation.

Yandex SpeechKit allows you to directly access the backend that is successfully used in Yandex mobile applications. We have been developing this system for quite a long time and now we correctly recognize 94% of words in the Navigator and Mobile Maps, as well as 84% of words in the Mobile Browser. In this case, recognition takes a little more than a second. This is already a very decent quality, and we are actively working to improve it.

It can be argued that in the near future voice interfaces will be practically no different in reliability from classical input methods. Detailed story about how we managed to achieve such results, and how our system works, below the cut.

Speech recognition is one of the most interesting and complex tasks artificial intelligence. The achievements involved here are very various areas: from computational linguistics before digital processing signals. To understand how a machine that understands speech should be structured, let's first understand what we're dealing with.

I. Basics

For us, spoken speech is, first of all, a digital signal. And if we look at the recording of this signal, we will see neither words nor clearly pronounced phonemes- different “speech events” smoothly flow into each other without forming clear boundaries. The same phrase spoken different people or in different environments, the signal level will look different. At the same time, people somehow recognize each other’s speech: therefore, there are invariants according to which, based on the signal, it is possible to reconstruct what was actually said. Finding such invariants is the task of acoustic modeling.

Let's assume that human speech consists of phonemes (this is a gross simplification, but to a first approximation it is correct). Let's define a phoneme as the minimum meaningful unit of language, that is, a sound, the replacement of which can lead to a change in the meaning of a word or phrase. Let's take a small section of the signal, say 25 milliseconds. Let's call this section a “frame”. What phoneme was spoken in this frame? It is difficult to answer this question unambiguously - many phonemes are extremely similar to each other. But if it is impossible to give an unambiguous answer, then one can reason in terms of “probabilities”: for a given signal, some phonemes are more probable, others less likely, and others can be excluded from consideration altogether. Actually, an acoustic model is a function that takes as input a small section of an acoustic signal (frame) and produces a probability distribution of various phonemes on this frame. Thus, the acoustic model allows us to reconstruct by sound what was said - with varying degrees of confidence.

Another important aspect acoustics - the probability of transition between different phonemes. From experience we know that some combinations of phonemes are easy to pronounce and occur frequently, while others are more difficult to pronounce and are used less often in practice. We can summarize this information and take it into account when assessing the “plausibility” of a particular sequence of phonemes.

Now we have all the tools to design one of the main "workhorses" automatic recognition speech - hidden Markov model (HMM, Hidden Markov Model). To do this, let’s imagine for a moment that we are not solving the problem of speech recognition, but the exact opposite - converting text into speech. Let's say we want to get the pronunciation of the word "Yandex". Let the word “Yandex” consist of a set of phonemes, say, [th][a][n][d][e][k][s]. Let's build a finite state machine for the word "Yandex", in which each phoneme is represented by a separate state. At each moment of time we are in one of these states and “pronounce” the sound characteristic of this phoneme (we know how each phoneme is pronounced thanks to the acoustic model). But some phonemes last a long time (like [a] in the word “Yandex”), others are practically swallowed. This is where information about the probability of transition between phonemes comes in handy. Having generated a sound corresponding current state, we accept probabilistic solution: should we remain in the same state or move on to the next one (and, accordingly, the next phoneme).

More formally, an HMM can be represented in the following way. First, let's introduce the concept of emissions. As we remember from the previous example, each of the HMM states “generates” a sound characteristic of this particular state (i.e., phoneme). At each frame, the sound is “played out” from the probability distribution corresponding to the given phoneme. Secondly, transitions are possible between states, also subject to predetermined probabilistic patterns. For example, the probability that the phoneme [a] will “stretch” is high, which cannot be said about the phoneme [d]. The emission matrix and transition matrix uniquely define the hidden Markov model.

Okay, we've looked at how a Hidden Markov Model can be used to generate speech, but how can we apply it to the inverse problem of speech recognition? The Viterbi algorithm comes to the rescue. We have a set of observable quantities (actually, sound) and a probabilistic model that correlates hidden states (phonemes) and observed quantities. The Viterbi algorithm allows you to restore the most probable sequence of hidden states.

Let there be only two words in our recognition dictionary: “Yes” ([d][a]) and “No” ([n"][e][t]). Thus, we have two hidden Markov models. Next, Let us have a recording of a user's voice saying “yes” or “no.” The Viterbi algorithm will allow us to answer the question of which recognition hypothesis is more likely.

Now our task comes down to restoring the most probable sequence of states of the hidden Markov model, which “generated” (more precisely, could have generated) the audio recording presented to us. If the user says "yes", then the corresponding sequence of states over 10 frames could be, for example, [d][d][d][d][a][a][a][a][a][a] or [d][a][a][a][a][a][a][a][a][a]. Likewise, it is possible various options pronunciations for "no" - for example, [n"][n"][n"][e][e][e][e][t][t][t] and [n"][n"] [e][e][e][e][e][e][t][t]. Now we will find the “best”, that is, the most likely, way to pronounce each word. At each frame we will ask our acoustic model , how likely it is that a specific phoneme sounds here (for example, [d] and [a]); in addition, we will take into account the probabilities of transitions ([d]->[d], [d]->[a], [a ]->[a]). This way we will get the most probable way of pronouncing each of the hypothesis words; moreover, for each of them we will get a measure of how likely it is that this particular word was pronounced (we can consider this measure as the length of the shortest path through the corresponding graph).The “winning” (i.e., more probable) hypothesis will be returned as the recognition result.

The Viterbi algorithm is quite simple to implement (dynamic programming is used) and operates in a time proportional to the product of the number of HMM states and the number of frames. However, it is not always enough for us to know the most likely path; for example, when training an acoustic model, you need to estimate the probability of each state at each frame. For this purpose, the Forward-Backward algorithm is used.

However, the acoustic model is just one component of the system. What to do if the recognition dictionary does not consist of two words, as in the example discussed above, but of hundreds of thousands or even millions? Many of them will be very similar in pronunciation or even the same. At the same time, in the presence of context, the role of acoustics decreases: slurred, noisy or ambiguous words can be restored “in meaning.” To take into account the context, they are again used probabilistic models. For example, a native Russian speaker understands that the naturalness (in our case, the probability) of the sentence “mom washed the frame” is higher than “mom washed the cyclotron” or “mom washed the frame.” That is, the presence of a fixed context “mother of soap...” defines the probability distribution for next word, which reflects both semantics and morphology. This type of language models is called n-gram language models (trigrams in the example discussed above); Of course, there are much more complex and powerful ways to model a language.

II. What's under the hood of Yandex ASR?

Now that we imagine general device speech recognition systems, we will describe in more detail the details of Yandex technology - the best, according to our data, Russian speech recognition system.
In considering the toy examples above, we have deliberately made several simplifications and omitted a number of important details. In particular, we argued that the basic “building unit” of speech is the phoneme. In fact, the phoneme is too large a unit; To adequately model the pronunciation of a single phoneme, three separate states are used - the beginning, middle and end of the phoneme. Together they form the same HMM as presented above. In addition, phonemes are position-dependent and context-dependent: formally, the “same” phoneme sounds significantly different depending on what part of the word it is in and what phonemes it is adjacent to. At the same time, a simple listing of all possible options context-sensitive phonemes will return very big number combinations, many of which never occur in real life; to make the number of acoustic events considered reasonable, close context-sensitive phonemes are combined into early stages training and are considered together.
Thus, we, firstly, made the phonemes context-sensitive, and secondly, we divided each of them into three parts. These objects - "phoneme parts" - now make up our phonetic alphabet. They are also called Senones. Each state of our HMM is a senon. Our model uses 48 phonemes and about 4000 senones.

So, our acoustic model still takes sound as input, and at the output it gives a probability distribution over senones. Now let's look at what exactly is supplied to the input. As we said, the sound is cut into 25 ms sections (“frames”). Typically, the slicing step is 10 ms, so that adjacent frames partially overlap. It is clear that “raw” sound - the amplitude of oscillations over time - is not the most informative form of representing an acoustic signal. The spectrum of this signal is much better. In practice, a logarithmic and scaled spectrum is usually used, which corresponds to the laws of human auditory perception(Mel-transformation). The resulting values are subjected to discrete cosine transform (DCT), and the result is MFCC - Mel Frequency Cepstral Coefficients. (The word Cepstral is obtained by rearranging the letters in Spectral, reflecting the presence of an additional DCT). MFCC is a vector of 13 (usually) real numbers. They can be used as input to an acoustic model in its raw form, but are often subject to many additional transformations.

Training an acoustic model is a complex and multi-step process. For training, algorithms from the Expectation-Maximization family are used, such as the Baum-Welsh algorithm. The essence of algorithms of this kind is the alternation of two steps: at the Expectation step, the existing model is used to calculate the expectation of the likelihood function; at the Maximization step, the model parameters are changed in such a way as to maximize this estimate. At the early stages of training, simple acoustic models are used: simple MFCC features are given as input, phonemes are considered without context dependence, and a mixture of Gaussians with diagonal covariance matrices (Diagonal GMMs - Gaussian Mixture Models) is used to model the emission probability in the HMM. The results of each previous acoustic model are the starting point for training more complex model, with a more complex input, output or emission probability distribution function. There are many ways to improve an acoustic model, but the most significant effect is the transition from a GMM model to a DNN (Deep Neural Network), which almost doubles the recognition quality. Neural networks are free from many of the limitations of Gaussian mixtures and have better generalization ability. In addition, acoustic models based on neural networks are more resistant to noise and have better performance.

The neural network for acoustic modeling is trained in several stages. To initialize the neural network, a stack of Restricted Boltzmann Machines (RBM) is used. RBM is a stochastic neural network that trains without a teacher. Although the weights it learns cannot be directly used to distinguish between classes of acoustic events, they reflect the structure of speech in detail. You can think of RBM as a feature extractor - the resulting generative model turns out to be an excellent starting point for building a discriminative model. The discriminative model is trained using the classic backpropagation algorithm, which applies a number of techniques, improving convergence and preventing overfitting. As a result, the input of the neural network is several MFCC-features frames (the central frame is subject to classification, the rest form the context), the output is about 4000 neurons corresponding to various senons. This neural network is used as an acoustic model in a production system.

Let's take a closer look at the decoding process. For the recognition task spontaneous speech With large dictionary The approach described in the first section is not applicable. You need a data structure that connects everything together possible suggestions that the system can recognize. A suitable structure is a weighted finite-state transducer (WFST) - essentially just a finite state machine with an output tape and weights on the edges. At the input of this machine are senons, at the output are words. The decoding process comes down to choosing the best way in this machine and provide an output sequence of words corresponding to this path. In this case, the price of passage along each arc consists of two components. The first component is known in advance and is calculated at the stage of assembling the machine. It includes the cost of pronunciation, transition to a given state, and likelihood assessment from the language model. The second component is calculated separately for a specific frame: this is the acoustic weight of the senon corresponding to the input symbol of the arc in question. Decoding occurs in real time, so not everything is examined possible ways: special heuristics limit the set of hypotheses to the most probable ones.

Of course, the most interesting part from a technical point of view is the construction of such an automaton. This problem is solved offline. To move from simple HMMs for each context-sensitive phoneme to linear automata for each word, we need to use a pronunciation dictionary. Creating such a dictionary is impossible manually, and methods are used here machine learning(and the task itself is scientific community called Grapheme-To-Phoneme, or G2P). In turn, the words are “joined” with each other into a language model, also presented in the form finite state machine. The central operation here is the WFST composition, but also important various methods optimizing WFST for size and storage efficiency.

The result of the decoding process is a list of hypotheses that can be further processed. For example, you can use a more powerful language model to rerank the most likely hypotheses. The resulting list is returned to the user, sorted by the confidence value - the degree to which we are confident that the recognition was performed correctly. Often there is only one hypothesis left, in which case the client application immediately proceeds to execute the voice command.

In conclusion, let us touch upon the issue of quality metrics for speech recognition systems. The most popular metric is Word Error Rate (and its inverse, Word Accuracy). Essentially, it reflects the proportion of incorrectly recognized words. To calculate the Word Error Rate for a speech recognition system, hand-labeled corpuses of voice queries that match the topic of the application using speech recognition are used.

SpeechKit Cloud is a program that gives developers access to Yandex speech recognition and synthesis technologies. Integration is implemented using the Yandex TTS module, available through the MajorDoMo system Add-ons Market.

The installation and configuration procedure is very simple and is completed in several steps.

1. Go to Control Panel

2. Go to the Add-ons Market

3. Go to the "Interaction" section

4. Add a module to the MajorDomo system - Control Panel - Add-ons Market - Interaction - Yandex TTS - Add:

5. The system will inform us about successful installation and redirect to the “Add-on Market” page:

6. To further configure the module, you need a Yandex Api Key, which can be obtained for free in the developer’s account using an existing account Yandex:

7. Assign a name to the key being created and click SpeechKit Cloud:

8. Fill in the required fields with data and click the “Submit” button:

9. If everything was done correctly, the generated API key will appear in the list on the right, which must be copied to the clipboard:

10. Open the settings of the Yantex TTS module (MajorDoMo - Control Panel - Applications - Yandex TTS), in the API-key field paste the key copied in the previous step, select the voice, mood, and also make sure that the module is activated:

11. Setup complete!

Attention! The test Yandex Api Key is generated for 1 month, after which the system will stop voicing new (non-cached) phrases. To obtain a permanent key, you must send a letter to Yandex with a request to transfer the key to a permanent one.

speech recognition technology

Yandex Speechkit Autopoet.

Preparing the text

Pronunciation and intonation

page or on a special resource website

Many of you have probably had the opportunity to control a computer or smartphone using your voice. When you tell the Navigator “Let's go to Gogol, 25” or say in the Yandex application search query, speech recognition technology converts your voice into a text command. But there is also inverse problem: turn the text that the computer has at its disposal into voice.

Yandex uses speech synthesis technology from the Yandex Speechkit complex to voice texts. For example, it allows you to find out how to pronounce foreign words and phrases in the Translator. Thanks to speech synthesis, Autopoet also received his own voice.

Preparing the text

Pronunciation and intonation

In other words, a lot of data is used to synthesize every 25 milliseconds of speech. Information about the immediate environment ensures a smooth transition from frame to frame and from syllable to syllable, and information about the phrase and sentence as a whole is needed to create correct intonation synthesized speech.

An acoustic model is used to read the prepared text. It differs from the acoustic model, which is used in speech recognition. In the case of model recognition, it is necessary to establish a correspondence between sounds with certain characteristics and phonemes. In the case of synthesis, the acoustic model, on the contrary, must, based on the descriptions of frames, create descriptions of sounds.

How does an acoustic model know how to pronounce a phoneme correctly or give the correct intonation? interrogative sentence? She learns from texts and sound files. For example, you can load an audiobook and the corresponding text into it. The more data a model learns from, the better its pronunciation and intonation.

More information about the technologies from the Yandex SpeechKit complex can be found on this page or on a special resource. If you are a developer and want to test cloud or mobile version SpeechKit, a site dedicated to Yandex technologies will help you.

","contentType":"text/html","amp":"

Many of you have probably had the opportunity to control a computer or smartphone using your voice. When you tell Navigator “Let's go to Gogol, 25” or say a search query in the Yandex application, speech recognition technology converts your voice into a text command. But there is also the opposite task: to turn the text that the computer has at its disposal into voice.

If the set of texts that need to be voiced is relatively small and the same expressions are found in them - as, for example, in announcements about the departure and arrival of trains at the station - it is enough to invite a speaker and record in the studio the right words and phrases, and then assemble a message from them. However, this approach does not work with arbitrary texts. This is where speech synthesis technology comes in handy.

Yandex uses speech synthesis technology from the Yandex Speechkit complex to voice texts. For example, it allows you to find out how foreign words and phrases are pronounced in the Translator. Thanks to speech synthesis, Autopoet also received his own voice.

Preparing the text

The problem of speech synthesis is solved in several stages. First, a special algorithm prepares the text so that it is convenient for the robot to read it: it writes all the numbers in words and expands the abbreviations. Then the text is divided into phrases, that is, into phrases with continuous intonation - for this, the computer focuses on punctuation marks and stable structures. For all words it is compiled phonetic transcription.

To understand how to read a word and where to put the emphasis in it, the robot first turns to the classic, hand-compiled dictionaries that are built into the system. If the required word is not in the dictionary, the computer builds a transcription on its own - based on rules borrowed from academic reference books. Finally, if normal rules turns out to be insufficient - and this happens, because any living language is constantly changing - it uses statistical rules. If the word appeared in the corpus training texts, the system will remember which syllable the speakers usually emphasized.

Pronunciation and intonation

When the transcription is ready, the computer calculates how long each phoneme will sound, that is, how many frames it contains - this is what fragments 25 milliseconds long are called. Then each frame is described according to many parameters: what phoneme it is part of and what place it occupies in it; what syllable does this phoneme belong to? if it is a vowel, is it stressed; what place does it occupy in a syllable; syllable - in a word; word - in a phrase; what punctuation marks are there before and after this phrase; what place does the phrase occupy in a sentence; finally, what sign is at the end of the sentence and what is its main intonation.

How does an acoustic model know how to pronounce a phoneme correctly or give the correct intonation to an interrogative sentence? She learns from texts and sound files. For example, you can load an audiobook and the corresponding text into it. The more data a model learns from, the better its pronunciation and intonation.

Finally, about the voice itself. What makes our voices recognizable, first of all, is the timbre, which depends on the structural features of the organs. speech apparatus Every person. The timbre of your voice can be modeled, that is, its characteristics can be described - to do this, it is enough to read a small corpus of texts in the studio. After this, data about your timbre can be used to synthesize speech in any language, even one you don’t know. When a robot needs to tell you something, it uses a generator sound waves- vocoder. It contains information about frequency characteristics phrases received from the acoustic model, as well as data on the timbre, which gives the voice a recognizable color.

More information about the technologies from the Yandex SpeechKit complex can be found on this page or on a special resource. If you are a developer and want to test the cloud or mobile version of SpeechKit, the site dedicated to Yandex technologies will help you.

","instantArticle":"

If the set of texts that need to be voiced is relatively small and they contain the same expressions - as, for example, in announcements about the departure and arrival of trains at the station - it is enough to invite a speaker, record the necessary words and phrases in the studio, and then collect of them message. However, this approach does not work with arbitrary texts. This is where speech synthesis technology comes in handy.

Preparing the text

The problem of speech synthesis is solved in several stages. First, a special algorithm prepares the text so that it is convenient for the robot to read it: it writes all the numbers in words and expands the abbreviations. Then the text is divided into phrases, that is, into phrases with continuous intonation - for this, the computer focuses on punctuation marks and stable structures. A phonetic transcription is compiled for all words.

To understand how to read a word and where to put the emphasis in it, the robot first turns to the classic, hand-compiled dictionaries that are built into the system. If the required word is not in the dictionary, the computer builds a transcription on its own, based on rules borrowed from academic reference books. Finally, if the usual rules are not enough - and this happens, because any living language is constantly changing - he uses statistical rules. If a word was found in the corpus of training texts, the system will remember which syllable the speakers usually emphasized in it.

Pronunciation and intonation

Finally, about the voice itself. What makes our voices recognizable, first of all, is the timbre, which depends on the structural features of the organs of the speech apparatus in each person. The timbre of your voice can be modeled, that is, its characteristics can be described - to do this, it is enough to read a small corpus of texts in the studio. After this, data about your timbre can be used to synthesize speech in any language, even one you don’t know. When the robot needs to tell you something, it uses a sound wave generator - a vocoder. Information about the frequency characteristics of the phrase received from the acoustic model is loaded into it, as well as data about the timbre, which gives the voice a recognizable color.

"),,"proposedBody":("source":"

Preparing the text

The problem of speech synthesis is solved in several stages. First, a special algorithm prepares the text so that it is convenient for the robot to read it: it writes all the numbers in words and expands the abbreviations. Then the text is divided into phrases, that is, into phrases with continuous intonation - for this, the computer focuses on punctuation marks and stable constructions. A phonetic transcription is compiled for all words.

To understand how to read a word and where to put the emphasis in it, the robot first turns to the classic, hand-compiled dictionaries that are built into the system. If the required word is not in the dictionary, the computer builds a transcription on its own, based on rules borrowed from academic reference books. Finally, if the usual rules are not enough - and this happens, because any living language is constantly changing - he uses statistical rules. If a word was found in the corpus of training texts, the system will remember which syllable the speakers usually emphasized in it.

Pronunciation and intonation

Finally, about the voice itself. What makes our voices recognizable, first of all, is the timbre, which depends on the structural features of the organs of the speech apparatus in each person. The timbre of your voice can be modeled, that is, its characteristics can be described - to do this, just read a small corpus of texts in the studio. After this, data about your timbre can be used to synthesize speech in any language, even one you don’t know. When the robot needs to tell you something, it uses a sound wave generator called a vocoder. Information about the frequency characteristics of the phrase received from the acoustic model is loaded into it, as well as data about the timbre, which gives the voice a recognizable color.

Preparing the text

The problem of speech synthesis is solved in several stages. First, a special algorithm prepares the text so that it is convenient for the robot to read it: it writes all the numbers in words and expands the abbreviations. Then the text is divided into phrases, that is, into phrases with continuous intonation - for this, the computer focuses on punctuation marks and stable structures. A phonetic transcription is compiled for all words.

To understand how to read a word and where to put the emphasis in it, the robot first turns to the classic, hand-compiled dictionaries that are built into the system. If the required word is not in the dictionary, the computer builds a transcription on its own, based on rules borrowed from academic reference books. Finally, if the usual rules are not enough - and this happens, because any living language is constantly changing - he uses statistical rules. If a word was found in the corpus of training texts, the system will remember which syllable the speakers usually emphasized in it.

Pronunciation and intonation

","contentType":"text/html"),"authorId":"24151397","slug":"kak-eto-rabotaet-sintez-rechi","canEdit":false,"canComment":false," isBanned":false,"canPublish":false,"viewType":"minor","isDraft":false,"isOnModeration":false,"isOutdated":false,"isSubscriber":false,"commentsCount":55," modificationDate":"Tue Apr 03 2018 18:56:00 GMT+0000 (UTC)","isAutoPreview":false,"showPreview":true,"approvedPreview":("source":"

When you tell Navigator “Let's go to Gogol, 25” or say a search query out loud, speech recognition technology converts your voice into a text command. There is also the opposite task: turning text into voice. Sometimes it is enough to invite a speaker and simply write down the necessary words and phrases, but this will not work with arbitrary texts. This is where speech synthesis technology comes in handy.

","contentType":"text/html"),"proposedPreview":("source":"

","contentType":"text/html"),"titleImage":("h32":("height":32,"path":"/get-yablogs/47421/file_1475751201967/h32","width": 58,"fullPath":"https://avatars.mds.yandex.net/get-yablogs/47421/file_1475751201967/h32"),"major1000":("height":246,"path":"/get- yablogs/47421/file_1475751201967/major1000","width":444,"fullPath":"https://avatars.mds.yandex.net/get-yablogs/47421/file_1475751201967/major1000"),"major288":(" height":156,"path":"/get-yablogs/47421/file_1475751201967/major288","width":287,"fullPath":"https://avatars.mds.yandex.net/get-yablogs/47421 /file_1475751201967/major288"),"major300":("path":"/get-yablogs/47421/file_1475751201967/major300","fullPath":"https://avatars.mds.yandex.net/get-yablogs/ 47421/file_1475751201967/major300","width":300,"height":150),,"major444":("path":"/get-yablogs/47421/file_1475751201967/major444","fullPath":"https:/ /avatars.mds.yandex.net/get-yablogs/47421/file_1475751201967/major444","width":444,"height":246),,"major900":("path":"/get-yablogs/47421/ file_1475751201967/major900","fullPath":"https://avatars.mds.yandex.net/get-yablogs/47421/file_1475751201967/major900","width":444,"height":246),"minor288": ("path":"/get-yablogs/47421/file_1475751201967/minor288","fullPath":"https://avatars.mds.yandex.net/get-yablogs/47421/file_1475751201967/minor288","width": 288,"height":160),,"orig":("height":246,"path":"/get-yablogs/47421/file_1475751201967/orig","width":444,"fullPath":"https: //avatars.mds.yandex.net/get-yablogs/47421/file_1475751201967/orig"),"touch288":("path":"/get-yablogs/47421/file_1475751201967/touch288","fullPath":"https ://avatars.mds.yandex.net/get-yablogs/47421/file_1475751201967/touch288","width":444,"height":246),,"touch444":("path":"/get-yablogs/ 47421/file_1475751201967/touch444","fullPath":"https://avatars.mds.yandex.net/get-yablogs/47421/file_1475751201967/touch444","width":444,"height":246),"touch900 ":("height":246,"path":"/get-yablogs/47421/file_1475751201967/touch900","width":444,"fullPath":"https://avatars.mds.yandex.net/get -yablogs/47421/file_1475751201967/touch900"),"w1000":("height":246,"path":"/get-yablogs/47421/file_1475751201967/w1000","width":444,"fullPath":" https://avatars.mds.yandex.net/get-yablogs/47421/file_1475751201967/w1000"),"w260h260":("height":246,"path":"/get-yablogs/47421/file_1475751201967/w260h260 ","width":260,"fullPath":"https://avatars.mds.yandex.net/get-yablogs/47421/file_1475751201967/w260h260"),"w260h360":("height":246,"path ":"/get-yablogs/47421/file_1475751201967/w260h360","width":260,"fullPath":"https://avatars.mds.yandex.net/get-yablogs/47421/file_1475751201967/w260h360"), "w288":("height":156,"path":"/get-yablogs/47421/file_1475751201967/w288","width":282,"fullPath":"https://avatars.mds.yandex.net /get-yablogs/47421/file_1475751201967/w288"),"w288h160":("height":160,"path":"/get-yablogs/47421/file_1475751201967/w288h160","width":288,"fullPath" :"https://avatars.mds.yandex.net/get-yablogs/47421/file_1475751201967/w288h160"),"w300":("height":162,"path":"/get-yablogs/47421/file_1475751201967 /w300","width":292,"fullPath":"https://avatars.mds.yandex.net/get-yablogs/47421/file_1475751201967/w300"),"w444":("height":246, "path":"/get-yablogs/47421/file_1475751201967/w444","width":444,"fullPath":"https://avatars.mds.yandex.net/get-yablogs/47421/file_1475751201967/w444" ),"w900":("height":246,"path":"/get-yablogs/47421/file_1475751201967/w900","width":444,"fullPath":"https://avatars.mds.yandex .net/get-yablogs/47421/file_1475751201967/w900"),"major620":("path":"/get-yablogs/47421/file_1475751201967/major620","fullPath":"https://avatars.mds. yandex.net/get-yablogs/47421/file_1475751201967/major620","width":444,"height":150)),,"tags":[("displayName":"Yandex technologies","slug":"tekhnologii -yandeksa","url":"/blog/company? ?tag=tekhnologii-yandeksa"),("displayName":"how does it work?","slug":"kak-eto-rabotaet","url":"/blog/company??tag=kak-eto- rabotaet")],"isModerator":false,"isTypography":false,"metaDescription":"","metaKeywords":"","relatedTitle":"","isAutoRelated":false,"commentsEnabled":true, "url":"/blog/company/kak-eto-rabotaet-sintez-rechi","urlTemplate":"/blog/company/%slug%","fullBlogUrl":"https://yandex.ru/blog /company","addCommentUrl":"/blog/createComment/company/kak-eto-rabotaet-sintez-rechi","updateCommentUrl":"/blog/updateComment/company/kak-eto-rabotaet-sintez-rechi", "addCommentWithCaptcha":"/blog/createWithCaptcha/company/kak-eto-rabotaet-sintez-rechi","changeCaptchaUrl":"/blog/api/captcha/new","putImageUrl":"/blog/image/put" ,"urlBlog":"/blog/company","urlEditPost":"/blog/57f4dd21ccb9760017cf4ccf/edit","urlSlug":"/blog/post/generateSlug","urlPublishPost":"/blog/57f4dd21ccb9760017cf4ccf/publish" ,"urlUnpublishPost":"/blog/57f4dd21ccb9760017cf4ccf/unpublish","urlRemovePost":"/blog/57f4dd21ccb9760017cf4ccf/removePost","urlDraft":"/blog/company/kak-eto-rabotaet-sintez-rechi/draft", "urlDraftTemplate":"/blog/company/%slug%/draft","urlRemoveDraft":"/blog/57f4dd21ccb9760017cf4ccf/removeDraft","urlTagSuggest":"/blog/api/suggest/company","urlAfterDelete":" /blog/company","isAuthor":false,"subscribeUrl":"/blog/api/subscribe/57f4dd21ccb9760017cf4ccf","unsubscribeUrl":"/blog/api/unsubscribe/57f4dd21ccb9760017cf4ccf","urlEditPostPage":"/blog/ company/57f4dd21ccb9760017cf4ccf/edit","urlForTranslate":"/blog/post/translate","urlRelateIssue":"/blog/post/updateIssue","urlUpdateTranslate":"/blog/post/updateTranslate","urlLoadTranslate": "/blog/post/loadTranslate","urlTranslationStatus":"/blog/company/kak-eto-rabotaet-sintez-rechi/translationInfo","urlRelatedArticles":"/blog/api/relatedArticles/company/kak-eto- rabotaet-sintez-rechi","author":("id":"24151397","uid":("value":"24151397","lite":false,"hosted":false),,"aliases": ("13":"chistyakova"),"login":"amarantta","display_name":("name":"Sveta Chistyakova","avatar":("default":"24700/24151397-15660497"," empty":false)),,"address":" [email protected] ","defaultAvatar":"24700/24151397-15660497","imageSrc":"https://avatars.mds.yandex.net/get-yapic/24700/24151397-15660497/islands-middle","isYandexStaff": true),"originalModificationDate":"2018-04-03T15:56:07.719Z","socialImage":("h32":("height":32,"path":"/get-yablogs/47421/file_1475751201967/ h32","width":58,"fullPath":"https://avatars.mds.yandex.net/get-yablogs/47421/file_1475751201967/h32"),"major1000":("height":246," path":"/get-yablogs/47421/file_1475751201967/major1000","width":444,"fullPath":"https://avatars.mds.yandex.net/get-yablogs/47421/file_1475751201967/major1000") ,"major288":("height":156,"path":"/get-yablogs/47421/file_1475751201967/major288","width":287,"fullPath":"https://avatars.mds.yandex. net/get-yablogs/47421/file_1475751201967/major288"),"major300":("path":"/get-yablogs/47421/file_1475751201967/major300","fullPath":"https://avatars.mds.yandex .net/get-yablogs/47421/file_1475751201967/major300","width":300,"height":150),,"major444":("path":"/get-yablogs/47421/file_1475751201967/major444"," fullPath":"https://avatars.mds.yandex.net/get-yablogs/47421/file_1475751201967/major444","width":444,"height":246),,"major900":("path":" /get-yablogs/47421/file_1475751201967/major900","fullPath":"https://avatars.mds.yandex.net/get-yablogs/47421/file_1475751201967/major900","width":444,"height": 246),"minor288":("path":"/get-yablogs/47421/file_1475751201967/minor288","fullPath":"https://avatars.mds.yandex.net/get-yablogs/47421/file_1475751201967/ minor288","width":288,"height":160),,"orig":("height":246,"path":"/get-yablogs/47421/file_1475751201967/orig","width":444, "fullPath":"https://avatars.mds.yandex.net/get-yablogs/47421/file_1475751201967/orig"),"touch288":("path":"/get-yablogs/47421/file_1475751201967/touch288" ,"fullPath":"https://avatars.mds.yandex.net/get-yablogs/47421/file_1475751201967/touch288","width":444,"height":246),,"touch444":("path" :"/get-yablogs/47421/file_1475751201967/touch444","fullPath":"https://avatars.mds.yandex.net/get-yablogs/47421/file_1475751201967/touch444","width":444,"height ":246),,"touch900":("height":246,"path":"/get-yablogs/47421/file_1475751201967/touch900","width":444,"fullPath":"https://avatars. mds.yandex.net/get-yablogs/47421/file_1475751201967/touch900"),"w1000":("height":246,"path":"/get-yablogs/47421/file_1475751201967/w1000","width": 444,"fullPath":"https://avatars.mds.yandex.net/get-yablogs/47421/file_1475751201967/w1000"),"w260h260":("height":246,"path":"/get- yablogs/47421/file_1475751201967/w260h260","width":260,"fullPath":"https://avatars.mds.yandex.net/get-yablogs/47421/file_1475751201967/w260h260"),"w260h360":(" height":246,"path":"/get-yablogs/47421/file_1475751201967/w260h360","width":260,"fullPath":"https://avatars.mds.yandex.net/get-yablogs/47421 /file_1475751201967/w260h360"),"w288":("height":156,"path":"/get-yablogs/47421/file_1475751201967/w288","width":282,"fullPath":"https:// avatars.mds.yandex.net/get-yablogs/47421/file_1475751201967/w288"),"w288h160":("height":160,"path":"/get-yablogs/47421/file_1475751201967/w288h160","width ":288,"fullPath":"https://avatars.mds.yandex.net/get-yablogs/47421/file_1475751201967/w288h160"),"w300":("height":162,"path":"/ get-yablogs/47421/file_1475751201967/w300","width":292,"fullPath":"https://avatars.mds.yandex.net/get-yablogs/47421/file_1475751201967/w300"),"w444": ("height":246,"path":"/get-yablogs/47421/file_1475751201967/w444","width":444,"fullPath":"https://avatars.mds.yandex.net/get-yablogs /47421/file_1475751201967/w444"),"w900":("height":246,"path":"/get-yablogs/47421/file_1475751201967/w900","width":444,"fullPath":"https: //avatars.mds.yandex.net/get-yablogs/47421/file_1475751201967/w900"),"major620":("path":"/get-yablogs/47421/file_1475751201967/major620","fullPath":"https ://avatars.mds.yandex.net/get-yablogs/47421/file_1475751201967/major620","width":444,"height":150)))))">