Modern methods of automatic speech recognition. Comparative analysis of open source speech recognition systems

July 15, 2009 at 10:16 pm

Speech recognition. Part 1. Classification of speech recognition systems

Artificial intelligence

Epigraph

In Russia, the area of speech recognition systems is indeed quite poorly developed. Google has long announced a system for recording and recognizing telephone conversations... Unfortunately, I have not yet heard about systems of similar scale and quality of recognition in Russian.

But you shouldn’t think that everyone abroad has already discovered everything a long time ago and we will never catch up with them. When I was looking for material for this series, I had to dig through a cloud of foreign literature and dissertations. Moreover, these articles and dissertations were from wonderful American scientists Huang Xuedong; Hisayoshi Kojima; DongSuk Yuk etc. It is clear who supports this branch of American science? ;0)

In Russia, I know only one smart company that has managed to bring domestic speech recognition systems to the commercial level: the Center for Speech Technologies. But perhaps after this series of articles it will occur to someone that it is possible and necessary to start developing such systems. Moreover, in terms of algorithms and mat. We practically didn’t lag behind the apparatus.

Classification of speech recognition systems

Today, the concept of “speech recognition” hides an entire area of scientific and engineering activity. In general, every speech recognition task boils down to extracting, classifying, and responding appropriately to human speech from the input audio stream. It could also be execution certain action on a person’s command, and the selection of a specific marker word from a large array of telephone conversations, and systems for voice text input.

Signs of classification of speech recognition systems

Each such system has certain tasks that it is designed to solve and a set of approaches that are used to solve the problems. Let's consider the main features by which recognition systems can be classified human speech and how this symptom can affect the operation of the system.

Dictionary size. Obviously, the larger the size of the dictionary that is built into the recognition system, the greater the error rate when recognizing words by the system. For example, a dictionary of 10 digits can be recognized almost without error, while the error rate when recognizing a dictionary of 100,000 words can reach 45%. On the other hand, even recognition is not large dictionary may produce a large number of recognition errors if the words in this dictionary are very similar to each other.
Speaker-dependence or speaker-independence of the system. By definition, a speaker-dependent system is designed to be used by a single user, while a speaker-independent system is designed to work with any speaker. Speaker independence is a difficult goal to achieve, since when training the system, it is adjusted to the parameters of the speaker on whose example it is being trained. The recognition error rate of such systems is usually 3-5 times higher than the error rate of speaker-dependent systems.
Separate or continuous speech. If in a speech each word is separated from the other by a section of silence, then they say that this speech is separate. Continuous speech is naturally spoken sentences. Recognition of continuous speech is much more difficult due to the fact that the boundaries of individual words are not clearly defined and their pronunciation is greatly distorted by blurring of the spoken sounds.
Purpose. The purpose of the system determines the required level of abstraction at which spoken speech recognition will occur. In a command system (for example, voice dialing in cell phone) most likely, recognition of a word or phrase will occur as recognition of a single speech element. A text dictation system will require greater recognition accuracy and, most likely, when interpreting the spoken phrase, will rely not only on what was spoken in this moment, but also on how it relates to what was said before. Also, the system must have a built-in set grammar rules, which the pronounced and recognizable text must satisfy. The stricter these rules are, the easier it is to implement a recognition system and the more limited the set of sentences that it can recognize will be.

Differences between speech recognition methods

When creating a speech recognition system, you need to choose what level of abstraction is adequate for the task, what parameters of the sound wave will be used for recognition and methods for recognizing these parameters. Let's consider the main differences in the structure and process of operation of various speech recognition systems.

By type of structural unit. When analyzing speech, as basic unit individual words or parts of spoken words, such as phonemes, di- or triphones, and allophones, can be selected. Depending on which structural part is selected, the structure, versatility and complexity of the dictionary of recognized elements changes.
By identifying features. The sequence of sound wave pressure readings itself is excessively redundant for sound recognition systems and contains a lot of unnecessary information that is not needed for recognition, or even harmful. Thus, to represent a speech signal, it is necessary to select from it some parameters that adequately represent this signal for recognition.
According to the mechanism of functioning. In modern systems they are widely used different approaches to the mechanism of functioning of recognition systems. The probabilistic network approach consists in the fact that the speech signal is divided into certain parts (frames, or according to phonetic characteristics), after which there is a probabilistic assessment of which element of the recognized dictionary it relates to. this part and/or the entire input signal. Solution-Based Approach inverse problem sound synthesis consists in the fact that the nature of the movement of the articulators of the vocal tract is determined from the input signal and, using a special dictionary, the pronounced phonemes are determined.

UPD: Moved to “Artificial Intelligence”. If there is interest, I will continue to publish there.

Commercial programs speech recognition appeared in the early nineties. They are usually used by people who, due to a hand injury, are unable to type a large amount of text. These programs (for example, Dragon NaturallySpeaking, VoiceNavigator) translate the user's voice into text, thus relieving his hands. The translation reliability of such programs is not very high, but over the years it has gradually improved.

Increased computing power mobile devices made it possible to create programs for them with speech recognition functions. Among such programs, it is worth noting the Microsoft Voice Command application, which allows you to work with many applications using your voice. For example, you can play music in your player or create a new document.

Intelligent speech solutions that automatically synthesize and recognize human speech are the next step in the development of interactive voice systems (IVR). The use of an interactive phone application is currently not a fashion trend, but a vital necessity. Reducing the workload of contact center operators and secretaries, reducing labor costs and increasing the productivity of service systems are just some of the benefits that prove the feasibility of such solutions.

Progress, however, does not stand still, and recently automatic speech recognition and synthesis systems have increasingly begun to be used in interactive telephone applications. In this case, communication with the voice portal becomes more natural, since selection in it can be made not only using tone dialing, but also using voice commands. At the same time, recognition systems are independent of speakers, that is, they recognize the voice of any person.

The next step in speech recognition technologies can be considered the development of so-called Silent Speech Interfaces (SSI). These speech processing systems are based on receiving and processing speech signals at an early stage of articulation. This stage The development of speech recognition is caused by two significant shortcomings of modern recognition systems: excessive sensitivity to noise, as well as the need for clear and distinct speech when accessing the recognition system. The SSI approach is to use new sensors that are not affected by noise as a complement to the processed acoustic signals.

Today, there are five main areas of use of speech recognition systems:

Voice control is a way to interact and control the operation of a device using voice commands. Voice control systems are ineffective for entering text, but are convenient for entering commands, such as:

Types of systems

Today, there are two types of speech recognition systems - those operating “client-based” and those operating on the “client-server” principle. When using client-server technology, a speech command is entered on the user’s device and transmitted via the Internet to a remote server, where it is processed and returned to the device in the form of a command (Google Voice, Vlingo, etc.); in view of large quantity server users, the recognition system receives a large base for training. The first option works on other mathematical algorithms and is rare (Speereo Software) - in this case, the command is entered on the user’s device and processed there. The advantage of processing “on the client” is mobility, independence from the availability of communication and operation of remote equipment. Thus, a system running “on the client” seems more reliable, but is sometimes limited by the power of the device on the user’s side.

Send your good work in the knowledge base is simple. Use the form below

Students, graduate students, young scientists who use the knowledge base in their studies and work will be very grateful to you.

Posted on http://www.allbest.ru/

INTRODUCTION

Human speech has been studied for a long time. In the middle of the twentieth century, the problem of automatic speech recognition by computers arose. Over half a century, scientists have managed to accumulate a huge amount of knowledge about the subject of research. It became clear that speech recognition is a very difficult task.

The core technique for many speech recognition systems is statistical method, called Hidden Markov Modeling (HMM). Such systems are being developed in many centers and are capable of good speech word recognition. The probability of word recognition reaches 80 - 90%.

The areas of application of automatic speech recognition systems are very diverse. For example, since the early nineties, several American and Canadian companies, commissioned by the US Department of Defense, have been developing recognition systems designed to intercept telephone conversations. Recently, recognition systems have been used in computer training courses. foreign language, systems for preparing text documents. Promising areas are the development of assistance systems for people with disabilities and improvement of the human-machine interface.

Factors hindering the widespread implementation of automatic speech recognition systems are:

The complexity of implementation in small-sized mobile equipment due to high computational costs and their significant unevenness, as well as the need to store a large dictionary (a set of models of recognizable speech units) in memory;

Significant deterioration in quality parameters under interference conditions.

This paper presents the basic principles of constructing speech recognition systems, pre-processing the source signal, constructing acoustic and language models, and considers a modern approach to noise immunity of recognition systems. Methods for assessing the quality of recognition systems are considered.

Attention is also paid to development problems, prospects for development and continuous improvement of recognition systems.

1. SPEECH RECOGNITION SYSTEMS

Speech recognition is the process of converting an electrically converted acoustic signal into a sequence of words. Recognized words can be end result, if the purpose of the system is control, data entry or document preparation. They may also be the basis for subsequent linguistic processing to achieve speech understanding.

1.1 Classification and structure of speech recognition systems

Classification

Speech recognition systems are characterized by many parameters, the main ones of which are given in Table 1.1.

Table 1.1. General parameters of speech recognition systems

Parameter	Range of change
Connectivity	Single words or continuous speech
	Speech based on written text or spontaneous
Adjustment	Dependence or independence on the speaker
	From small(<20 слов) до большого(>20000)
Language model	Stateful or context dependent
Perplexity	From small (< 10) до большой (> 100)
	Large (>30dB) to small (<10dB)

If the system is designed to recognize individual words, then the speaker must pause between them; if for continuous speech, then no. Spontaneous speech usually contains much more incoherence than the speech of a person reading written text, and is therefore more difficult to recognize. Some systems require speaker matching, where the user must say some words or phrases to adjust the system before using the system, while other systems do not require this. Recognition in general is more difficult when the vocabulary size is large and contains many similar-sounding words.

The simplest model of a language can be described by a network with a certain number of states. In it, the set of valid words following each word is definite. Models that approximate natural language are defined using context-sensitive grammars.

A widely used indicator of the complexity of a problem solved by a recognition system is perplexity (difficulty, complexity, complexity). Perplexity is defined as the number of possible words following a given word in a given language model.

The recognition system is also characterized by such a parameter as the maximum permissible signal-to-noise ratio (SNR).

Speech recognition is a complex task, mainly due to the large number of sources influencing the parameters of the speech signal:

The acoustic sound of phonemes, the smallest speech units, strongly depends on the phonetic context surrounding them (/t/ in the words two, true, butter), in phrases the contextual dependence becomes even stronger (“master production”, “learn good manners”);

Acoustic signal variations due to differences in room acoustics, microphone characteristics and placement;

The physical and emotional state of the speaker;

His age, gender, social status, dialect.

The general structure of the speech recognition system is presented in Figure 1.1.

Figure 1.1 - Structure of the speech recognition system.

The speech signal is divided into sections, and a set of parameters is calculated for each section. These parameters are used to find the best candidate word within the available acoustic, lexical and language models. Lexical models in modern systems are included in the language model as principles and methods for creating a dictionary based on the existing text base and searching in it. In the simplest systems, the language model degenerates into a lexical one.

1.2 Current level of development

The quality of the recognition system is usually assessed using such an indicator as the error rate:

(1.1)

N is the total number of words in the test set, S, I, D are the number of substitutions, insertions and deletions of words, respectively.

Since the 1990s, significant advances have been made in speech recognition technology. The error rate decreased by approximately 2 times every 2 years. The barriers of the recognition system's dependence on the speaker, continuous speech recognition, and the use of a large dictionary have been largely overcome. Several factors contributed to this:

- use of Hidden Markov Models (HMM);

Development of standard rules for compiling speech databases for training and testing (TIMIT, RM, ATIS, WSJ, etc.), they allow developers to determine the number of acoustic cues important for emphasizing phonetic features, based on statistical techniques. Standardization of training and testing rules also makes it possible to compare the performance of different systems;

- a significant increase in the performance of computing systems.

A typical task with a low level of perplexity (PP = 11) is recognizing numbers in a standard telephone channel. Here, an error rate of 0.3% has been achieved with a known length of the sequence of digits.

The tasks of the average level of perplexity are resource management tasks, for example, a spontaneous speech recognition system for the air traffic information system (Air Travel Information Service, ATIS) with a dictionary of about 2000 words and PP = 15 achieves an error rate of no more than 3%.

Systems designed for text dictation have a high level of perplexity (PP? 200) and a large dictionary (about 20,000 words). The error rate they achieved is about 7%.

The main areas of application of recognition systems are voice dialing of a telephone number (for example, “calling home” instead of dialing a number), document preparation, information and reference systems, and foreign language teaching systems.

1.3 Prospects

Noise immunity

The quality of operation of recognition systems catastrophically decreases as the difference between the conditions for recording training speech data and the conditions of real work increases due to various interferences. Therefore, the influence of the acoustic environment and the electrical characteristics of the transmission channel will be given special attention.

Portability

When modern systems are transferred to solve a new problem, the quality of their work is greatly reduced. To improve it, retraining of the system is required. Portability implies the ability to use the system to solve different problems with minimal adjustment.

Adaptation, adjustment

Even during the operation of the system to solve the same problem, external conditions may change (speakers, microphones, etc.). It is necessary to decide how to force the system to improve the quality of work during operation and adapt to new conditions.

Language models

Modern systems use statistical language models to reduce the search space and resolve acoustic model uncertainty. As the size of the vocabulary grows and other constraints relax, defining the rules and constraints imposed by the syntax of the language being recognized becomes increasingly important to creating viable systems. At the same time, purely statistical language models will include more and more syntactic and semantic rules and restrictions.

A measure of confidence in hypotheses

Most recognition systems for ordering hypotheses associate each hypothesis with a certain weight, a number. Currently, this weight, as a rule, is not an indicator of confidence in a given hypothesis (i.e., why this hypothesis is better than others). For management problems, it is necessary to improve methods for assessing the reliability of hypotheses.

Words not included in the dictionary

Systems are designed for use with a specific vocabulary. However, in real life there will always be a certain percentage of words that are not included in the dictionary. There must be methods for detecting the presence of such words and processing them.

Spontaneous speech

Systems operating in real conditions always encounter various phenomena inherent in spontaneous speech: false starts, stuttering, ungrammatical constructions, etc. The development of ATIS has solved many issues in this area, but not all.

Prosody (intonation and rhythm)

Intonation and rhythmic structure of speech carry information about the meaning of spoken words. However, the question of how to integrate prosodic information into a recognition system has not yet been resolved.

Dynamics Simulation

Modern systems receive a sequence of sections of an acoustic signal and process them as static and independent of each other. However, it is known that signal sections perceived as phonemes and words require the combination of parameters extracted from the signal and their presentation in dynamics. This would reflect dynamic articulation. How to model the dynamics of a speech signal for a recognition system is an unsolved problem.

2. REPRESENTATION OF THE ORIGINAL SIGNAL

2.1 Principles of signal pre-processing

In speech recognition based on statistical methods, the original signal is sampled at a frequency of 6.6 to 20 kHz and processed to represent it as a sequence of vectors in feature space that model the state of the speaker's vocal tract. In this case, a section of the original signal with a duration of 10 - 25 ms, which is 150 - 300 samples, usually highly correlated with each other, is decomposed into an orthogonal series and, for a given error value, is presented in the form of 10 - 20 expansion coefficients, called parameters.

These parameter vectors are used in subsequent steps to estimate the probability of a vector or sequence of vectors belonging to a phoneme or a whole word when testing the membership hypothesis.

In most systems, the processes of vector representation of a signal and probability estimation are closely related. Therefore, it is assumed that if an operation or procedure is applied to a speech signal, it belongs to the presentation stage. If it is used to test a hypothesis, then it is part of the compliance calculation stage.

The purpose of the signal representation stage is to preserve all useful information necessary for the phonetic identification of the section of the speech signal in question. At the same time, the presentation should be as immune as possible to such factors as differences between speakers, features of communication channels, and emotional state. The presentation should also be as compact as possible.

The representations used in modern systems reflect more the properties of the speech signal due to the shape of the vocal tract than the excitation signal (the fundamental tone generated by the larynx and vocal cords). Representations only determine whether the vocal cords vibrate or not, i.e. whether the sound is vocalized.

The representations used are almost always derived from a limited energy spectrum, the power spectral density of the signal

where x1, …, xl, …, xn is the initial sequence of samples in the segment; S(ejш) - spectral coefficients. The use of the energy spectrum is advisable because the ear is insensitive to the phase of the acoustic signal.

In addition, the energy spectrum almost always uses a logarithmic representation. This makes it possible to reduce excessively large changes in parameters with significant fluctuations in signal amplitude, as well as transform multiplicative acoustic effects and interference from the equipment used into additive interference. The disadvantage of the logarithmic representation is the uncertainty of the logarithm of zero. This requires limiting the minimum amplitude scale of the signal to some non-zero value and limiting the signal itself from below to avoid excessive sensitivity to low-energy spectral components, which are mainly noise. .

Figure 2.1 - Representation of speech signal for recognition

Before calculating the spectrum, the signal usually undergoes preliminary filtering, which ensures that the signal gain increases with increasing frequency with a slope of 6 dB/octave to compensate for the attenuation introduced by the electrical path. The original signal is then divided into successive, overlapping sections, typically 25 ms in length, which are processed by a bell function to reduce the signal amplitude at the edges of the section. Then the power spectral density is calculated.

The resulting energy spectrum has an undesirable harmonic component at the fundamental frequency. This component can be reduced by grouping adjacent sets of spectral components to form a group of about 20 bands before calculating the logarithm of power. These bands are often made progressively wider in increments of 1 kHz. It is also possible to use a set of digital filters. The results are similar.

The cepstral representation of the signal further reduces the correlation of neighboring samples in the original signal. Here it is assumed that speech is the output signal of a linear system with slowly changing parameters - the vocal tract, excited either by a sequence of fundamental tone pulses or by noise. Analysis of the speech signal in this case consists of calculating the parameters of the vocal tract from the measured parameters of the speech signal and assessing them over time. Since the excitation signal x(n) and the filter impulse response h(n) interact through the convolution operation, the analysis problem is considered as a problem of separating the components involved in the convolution operation. This problem is called the inverse convolution or unwrapping problem. To solve it, it is necessary to find the following homomorphism: C(x(n)*h(n)) = C(x(n)) + C(h(n)). This homomorphism can be implemented using the following transformation:

c(n) = F-1(ln[ |F(x(n))| ]),(2.2)

which is called the cepstrum of a discrete signal x(n), F and F-1 are, respectively, direct and inverse discrete Fourier transform.

The autoregressive representation of the signal (linear prediction, LPC) is associated with the same model of speech signal formation. Autoregression coefficients are calculated from the condition of directly minimizing the correlation between close samples of the speech signal x(ti):

At the initial stage of calculating signal parameters, different developers use different models, energy spectrum or autoregression, for example, in telephony autoregression is usually used, since in all modern telephone vocoders these parameters are calculated. In computing systems, the spectrum is usually calculated because the components used to calculate it can be used by other applications. In the future, the cepstral coefficients Ci are calculated, as they are most suitable for the recognition task. Calculating the cepstrum through autoregression is computationally more economical, which is good for limited telephony resources. There is no such strict limitation for computers, but versatility and the ability to reuse code are important, so spectrum is better. Some systems also calculate the dynamics of changes in signal parameters dCi within a signal section and between adjacent sections.

Various constant external factors, such as the characteristics of a particular telephone connection, appear as a constant component (bias) of the spectrum or cepstrum. Difference, dynamic parameters dCi are not subject to such effects. If the first-order dynamic parameters are passed through the integrator, values close to the original, static parameters Ci will be restored. A similar technique applied to sequences of power spectrum coefficients, before taking a logarithm, is useful for reducing interference from stationary or slowly varying additive noise.

Since the cepstral coefficients are almost uncorrelated, a computationally efficient method for obtaining reasonably good probabilistic estimates in the subsequent matching process is to calculate the Euclidean distances to the corresponding model vectors. The calculation of distances is made after suitable weighting of the coefficients (parameters). There are many weighting methods, grouped into two main classes: empirical and statistical.

There are techniques that combine the listed methods and make it possible to almost completely remove the correlation of parameters, however, due to increasing computational costs, they are currently used for demonstration purposes.

2.2 Prospects

Currently, the possibility of using wavelet transformations and neural network methods at the stage of presenting the original signal is being studied, allowing nonlinear operations with the original signal or with the results of other transformations. The development of representations that more accurately reflect the acoustics of a room, as well as restore articulation from a speech signal, continues.

Modern methods of representing a signal use only the spectrum shape, without taking into account the fundamental frequency. However, it is known that even in single word recognition, pitch frequency can be a clue to lexical word identification. This applies not only to tonal languages like Chinese, but also to European ones, since this frequency is related to lexical stress. In connected speech, the fundamental tone carries information about the syntactic structure of the sentence and the mood of the speaker. Research in this area will continue.

3. NOISE-RESISTANCE SPEECH RECOGNITION

3.1 Determination of noise immunity

Noise immunity (robustness) in speech recognition is associated with the need to ensure sufficient accuracy under destabilizing factors:

When the quality of the input speech signal is low,

When there are significant differences in the acoustic, articulatory and phonetic characteristics of speech between the training and testing (work) conditions.

The sources of these factors are:

Acoustic interference in the form of additive noise,

Linear filtration phenomena

Nonlinear distortions during conversion and transmission of the original signal,

Pulse interference

Changes in speaker articulation caused by the presence of noise sources.

Modern systems designed to operate in favorable acoustic conditions have largely achieved speaker independence, compensating for some signal degradation due to noise and unknown linear filtering. However, for real-world applications, the need for improved robustness is clear. Even the best modern systems significantly degrade performance if the signal being recognized passes through a telephone channel or if the speaker speaks with an accent. Next, we consider resistance to signal distortion caused by surrounding, external sources of interference. The main approaches to robustness are dynamic adjustment of parameters, the use of microphone arrays, and signal processing taking into account psychological models of perception.

3.2 Dynamic adjustment of parameters

Typically, models for adapting systems to changing environmental conditions assume that the sources of speech quality degradation are additive noise with an unknown power spectral density distribution or a combination of additive noise and linear filtering. To compensate for these interferences, the system can dynamically adjust the acoustic parameters, both calculated from the recognized input signal and the acoustic models of utterances stored by the system. There are three main approaches to dynamically adjusting parameters:

Using optimal estimation to obtain new parameter values under testing conditions,

Application of compensation based on empirical comparison of speech signal in training and testing conditions,

High-pass filtering of parameter values.

Optimal parameter estimation

Two main approaches to optimal estimation are used.

The first is based on formal statistical model, which characterizes the difference between the speech used when training the system and the speech used when testing the system. The values of the model parameters are estimated using test speech samples recorded in different environments, after which either the calculated parameters of the input signal or the acoustic models of speech units stored in the system are modified. Experiments show that this approach significantly reduces the number of errors when recognizing a speech signal with additive noise. However, this approach is unable to seriously counteract the deterioration of speech quality in real conditions.

A second popular approach is to use knowledge of noise to force phonetic models to characterize speech with noise. Knowledge is derived from existing interference patterns and used to adjust the parameters of phonetic models (changes in means and variances) calculated from speech without interference. This approach is implemented in a technique called parallel model combination. It gives good results for additive, multiplicative interference and for real speech signals. However, currently too high computational costs prevent its use in recognition systems.

Empirical comparison of parameters

The parameters extracted from speech without interference are compared with the parameters of the same speech recorded with interference. In this approach, the combined effect of various interferences is considered as additive violations of signal parameters. When comparing parameters, correction vectors are calculated, which are then used to correct either the parameter vectors of the input recognized signal or the parameter vectors of acoustic models stored in the recognition system.

Recognition accuracy is improved if correction vectors are assumed to depend on: the signal-to-noise ratio, the location in parameter space within a given signal-to-noise ratio, or the expected correspondence of phonemes.

This general approach can be extended to cases where the test environment is unknown a priori, by forming an ensemble of correction vectors for many different test environmental conditions. The correction vectors are then sequentially applied to the speech models, starting with the presumably most likely vector, until the best match to the vector obtained from the input signal is found.

If the conditions for calculating correction vectors are close to the actual operating conditions of the system, the quality of its operation is quite high. The disadvantage is the need to use stereo recording to create a database of acoustic models.

Applying High Pass Filters

The use of high-frequency or bandpass filtering when calculating cepstral coefficients allows one to significantly increase the noise immunity of the system at a minimum cost. This method is implemented in the RASTA and CMN algorithms. These algorithms are now used in almost all systems where noise immunity is required.

3.3 Using microphone arrays

Additional improvement in recognition accuracy at low signal-to-noise ratios can be achieved using a microphone array. Such a matrix, in principle, can realize directional sensitivity with a characteristic that has a maximum in the direction of the speaker and minimums in the direction of interference sources, similar to a phased array antenna in radio communications. By changing the phasing of individual elements using adders and delay lines, you can fine-tune the directivity characteristics as operating conditions change. At the same time, algorithms are used to compensate for the spectral coloring introduced by the matrix itself. Experiments with a microphone array in an office environment showed a reduction in the error rate to 61% for interference in the form of an additive noise source.

Although the matrix is effective against interference in the form of additive, independent noise, it significantly degrades performance in the presence of many reflective surfaces, when the interference is a slightly delayed and attenuated part of the useful signal.

More advanced systems use algorithms based on cross-correlation to compensate for signal delay interference. These algorithms are capable of amplifying the acoustic field in certain directions. However, they only marginally improve system performance compared to simple delay and sum algorithms.

3.4 Psychologically based signal processing

Processing the original speech signal taking into account psychological models of perception simulates various aspects of human speech perception. Such processing systems typically include a set of bandpass filters that simulate the frequency sensitivity of human hearing, followed by nonlinear signal processing devices within and between channels.

Recent evaluations of recognition systems show that perceptual simulation models provide better recognition accuracy than traditional cepstrum, both under noisy conditions and across differences in training and testing conditions. However, these models are inferior in terms of quality to algorithms for dynamic adjustment of parameters; in addition, dynamic adjustment is less expensive.

It is possible that the failure of simulation models is associated with the use of Hidden Markov models for classification, which turn out to be poorly adapted to work with the resulting parameters. A number of researchers also believe that the optimal set of parameters calculated using these models and characterizing the speech signal as accurately as possible has not yet been found. Therefore, this area continues to attract close attention from researchers.

3.5 Outlook

Despite its obvious importance, robustness in speech recognition has only recently attracted the attention of researchers. Significant success has been achieved only for conditions of fairly “friendly” interference, such as additive noise or linear filtering. The independence of systems from the speaker now extends only to native speakers. For people speaking with an accent, recognition accuracy is significantly lower, even when adjusted to the speaker.

Speech on the phone

Telephone speech recognition is difficult because each telephone channel has its own signal-to-noise ratio and frequency response. In addition, speech distortion can be caused by short-term interference or nonlinearities. Phone line applications must be able to adapt to different channels with little channel data.

High noise environment

Even when using various noise compensation techniques, recognition accuracy drops significantly at a signal-to-noise ratio below 15 dB, while a person is able to hear speech perfectly at a much lower ratio.

Crosstalk

The influence of other conversations, for example in the same room or interference on an adjacent telephone channel, is a much more difficult problem than broadband noise interference. So far, efforts to use information that distinguishes recognized speech from interfering speech have not led to significant results.

Quick adaptation to accent in speech

In today's fast-paced society, serious language applications must be able to understand speakers without an accent as well as those with an accent.

Development of principles for creating speech databases

Progress in noise-resistant recognition will also depend on the development of principles for creating speech databases and directly on the creation of such databases. To do this, it is necessary to collect, process and structure many samples of distortions and interference characteristic of practical problems.

4. ACOUSTIC MODELS

4.1 Place of the acoustic model in the system

Modern speech recognition systems are implemented mainly as software products that generate hypotheses about spoken sequences of words based on the input signal. The algorithms used in such systems are based on statistical methods.

The vector yt of acoustic parameters is calculated from the input signal every 10-30 ms. The sequences of these vectors are considered as observable sequences generated by phonetic models. Based on this, the probability p(ylT/W) of observing a sequence of vectors ylT when pronouncing a sequence (word) W is calculated, in other words, the probability of generating a sequence ylT by a model W. Given a sequence ylT, you can search using the rule:

find the most likely sequence of words that generated ylT. This search procedure finds the sequence of words that has the maximum posterior probability. The probability p(ylT/W) is calculated by the acoustic model, and p(W) by the language model.

For systems with a large dictionary, the search consists of two stages. In the first, by calculating approximate probabilities in real time using simplified models, a lattice of the n best word sequences is generated. At the second stage, more accurate probabilities are calculated with a limited number of hypotheses. Some systems generate a probable sequence of words in one step.

4.2 Acoustic models based on Markov chains

Acoustic models are elementary probabilistic models of basic linguistic units (i.e. phonemes) and are used to represent the next level units - words.

The sequence of acoustic parameters obtained from a spoken phrase is considered as the implementation of a set of processes described using Hidden Markov Models (HMMs). HMM is a set of two random processes:

Hidden Markov chain responsible for changes over time,

Sets of observable stationary processes responsible for spectral changes.

SMM has proven in practice that it can cope with the main sources of ambiguity in a speech signal, such as variations in phoneme pronunciation, while allowing the creation of systems with a dictionary of tens of thousands of words.

SMM structure

The model is defined as a pair of random processes (X, Y). Process X is a first-order Markov chain, the implementations of which are not directly observable. Realizations of the process Y take their values from the space of acoustic parameters, are observed directly, and their distributions depend on the realizations of the process X.

HMM is characterized by two formal assumptions. The first concerns the Markov chain and states that the next state of the chain is determined only by the current state and does not depend on the previous trajectory. The second states that the current distribution of process Y, from which the observed value of the acoustic parameter is taken, depends only on the current state of the Markov chain (process X), and not on the previous trajectories of processes X and Y.

Appendix 1 provides a mathematical definition of the model, an example of generating an observed sequence, and calculation formulas.

To reestimate the model parameters during its training, the Baum-Welsh algorithm is used, based on probability reestimation using the Bayes formula.

HMMs can be classified according to the elements of matrix B, which by their nature are distribution functions.

If the distribution functions are defined on a finite space, then the model will be discrete. In this case, the observed realization is a vector of values from a finite alphabet of M elements. For each element of the vector Q selected from the set V, a non-zero discrete density (w(k)/k=1,…,M) is defined, forming the distribution. This definition assumes the independence of the elements of the set V.

If the distributions are defined as probability densities on a continuous space, then the model will be continuous. In this case, requirements are imposed on the distribution functions in order to limit the number of estimated parameters to acceptable limits. The most popular approach is to use a linear combination of densities g from the family of G standard distributions with a simple parametric form. Typically, g is used as a multivariate normal distribution, characterized by a vector of mathematical expectation and a covariance matrix. The number of standard distributions involved in linear combination to form the resulting distribution is usually limited by computational capabilities and the amount of training data available.

Tuning distribution parameters during training of a continuous model requires a large number of training samples. If they are insufficient, they resort to using a pseudo-continuous model, in which a standard set of basic densities is used to form a linear combination. Linear combinations differ from each other only in their weighting coefficients. The general approach is to associate each input vector coordinate with its own distinct set of base densities.

4.3 Word modeling

Phonetic decomposition

A word is usually represented by a network of phonemes. Each path in the network represents a variant pronunciation of a word.

The same phoneme, pronounced in different contexts, may have different acoustic parameters, and therefore be modeled by different distributions. Allophones are patterns that represent a phoneme in different contexts. Deciding how many allophones will represent a particular phoneme depends on many factors, the main one being the amount of training data to tune the parameters of the acoustic model.

There are several varieties of the allophone model. One of them is polyphones. In principle, the pronunciation of a phoneme is different in all words where it occurs, therefore requiring different allophones. With a large vocabulary, it is almost impossible to train such a model due to the lack of training data. Therefore, the representation of allophones is used at several levels of detail: word, syllable, triphone, diphone, context-independent phoneme. Probability distributions of allophones at different levels of detail can be obtained by combining distributions of more detailed levels of representations. The loss of features is compensated by an improvement in the estimation of the statistical parameters of the model during its training due to an increase in the ratio of the volume of training data to the number of estimated model parameters.

Another variation is to cluster allophones into a certain number of possible classes of contexts. The class search is carried out automatically using a classification and regression tree (CART). This is a binary tree, at the root there is a phoneme, with each node associated a question about the context like: “Is the previous phoneme a nasal consonant?” For every possible answer (yes, no) there is a branch to another node. The leaves of the tree are allophones. There are CART growth and pruning algorithms that automatically associate questions from a manually created pool with nodes.

Each allophone in recognition systems is modeled using HMM. In general, all models can be built using distributions drawn from a single, shared pool or up to several thousand clusters called senones.

Models of higher-level allophones, such as words, can also be constructed by concatenating base models using connecting transitions and distributions. Such building blocks are called phenons and multons.

Another approach to modeling words is to use a codebook - a set of reference features that are its words. Using the input vector of signal parameters, the closest reference sign from the code book is found, which has its own number. For the codebook, a standard set of basic densities is used, words are represented by sequences of feature numbers. Each number sequence is then modeled using an HMM.

Determining word boundaries and probabilities

In general, the speech signal and its representations do not provide clear indications of the boundaries between words, hence word boundary detection is part of a hypothesis process performed as a search. During this process, word patterns are compared to a sequence of acoustic parameters. In a probabilistic framework, comparison of acoustic sequences with models involves calculating the probability of a given sequence being generated by a given model, i.e. calculation of p(ylT/W). This is a key component of the recognition process.

For a given time sequence: 1, 2, …, t, t+1, …, T-1, T:

Probability dt(i) that by time t the sequence o1,o2…ot has been observed and the model is in state Si (forward algorithm):

for all 1?i?N, 1?j?N, t = 1,2,…,T-1:

at t = 1: d1(i) = pi bi(o1);(4.2)

for t > 1: dt(j) = .(4.3)

Probability ft(i) of observing the sequence ot+1,ot+2,…oT starting from moment t+1 to T, provided that at moment t the model is in state Si (backward algorithm):

for all 1?i?N, 1?j?N, t = T-1,T-2,…,1:

at t = T: fT(i) = 1;(4.4)

at t< T: ft(i) = .(4.5)

The total probability that a model will pass a certain trajectory in T clock cycles (the probability of matching the sequence and the model) can be calculated in three ways:

P(O/l) = ;(4.6)

P(O/l) = ;(4.7)

P(Q/l) = dt(i) ft(i) = .(4.8)

An example of probability calculation is given in Appendix 2.

For calculations, models are used in the form of a linear sequence of states with a beginning and an end. Transitions are only possible in place and from beginning to end without jumping over states. Before calculating the correspondence, the initial sequence of parameter vectors is divided into segments equal in length to the given model.

4.4 Outlook

Significant advances in acoustic modeling achieved in recent years have made it possible to realize good recognition quality when using a large dictionary in real time, while consuming an acceptable amount of resources. However, there are a number of aspects that require improvement. First of all, this concerns adaptation to different speakers and different acoustic environments, including in the presence of interference. There are also difficulties in processing stutters, false starts, words missing from the dictionary, and other features inherent in spontaneous speech.

Main directions modern research are acoustic noise immunity, improving systems of acoustic parameters and models, working with a large lexicon, supporting multiple contexts and multiple languages, developing methods for automatic training of systems.

5. LANGUAGE MODELS

5.1 Place of the language model in the system

Speech recognition systems convert the acoustic signal into an orthographic representation of the spoken utterance. The recognizer builds hypotheses using the finite dictionary. For simplicity, it is assumed that a word is uniquely identified by its pronunciation.

Significant progress in solving the recognition problem has been achieved with the start of using a statistical model joint distribution p(W,O) of a sequence of spoken words W and the corresponding acoustic sequence O. This approach was first used by IBM under the name “source-channel model”. It determines the assessment of the correspondence of the selected vocabulary sequence to the observed acoustic fact O using the posterior distribution p(W/O).

To minimize error, the system selects a dictionary sequence that maximizes this posterior distribution:

where p(W) is the probability of the sequence of words W, p(O/W) is the probability of observing the acoustic sequence O when pronouncing the sequence of words W, p(O) is the total probability of observing the sequence O according to all available acoustic models. p(O/W) = p(ylT/W) = P(O/ l) and is calculated at the stage of acoustic modeling using HMM and is called the channel. p(O) is assumed to be equal to 1. The prior probability p(W) is calculated using a language model (LM).

A similar recognition model is used to recognize printed and handwritten texts.

5.2 Trigram-based language model

For a given sequence of words W=(w1,…,wn), its probability can be represented as:

w0 is determined to be suitable to ensure the initial conditions. The probability of each next word wi depends on the already spoken sequence hi. With this definition, the complexity of the model grows exponentially as the spoken sequence of words increases. To simplify the model, making it practical for practice, it is assumed that only some aspects of the story affect the probability of the next word. One way to achieve this is to use some operation μ(), which divides the historical space into K equivalent classes. Then you can apply the model:

The greatest success in the last 20 years has been achieved with the help of simple models n-gram. Most often, trigrams are used, where only the two previous words determine the probability of the next word. In this case, the probability of a sequence of words looks like this:

To estimate the prior probabilities p(W) of the NM, a large amount of educational textual material is needed. During the assessment, frequencies are calculated:

where c123 is the number of occurrences of the word sequence (w1, w2, w3), c12 is the number of occurrences of the sequence (w1, w2,). For a dictionary of volume V, there are V3 possible trigrams; for a dictionary of 20 thousand words, there are 8 trillion. Obviously, many of these trigrams will not be found in training sequences, so for them f3(w3/w1, w2) = 0. To ensure that the corresponding probabilities are not equal to zero, linear interpolation of the frequencies of trigrams, bigrams and words, as well as their uniform distribution on the dictionary:

f1() and f2() are evaluated by counting the corresponding bigrams and trigrams. Coefficients l of linear interpolation are estimated by searching for the maximum probability for new data that did not participate in the calculation of n-gram frequencies. When maximizing, a forward-backward algorithm is used (formulas (4.2) - (4.5)).

In general, more than one l vector can be used. It is also advisable to take into account greater confidence in trigram frequencies estimated on a larger number of training sequences. To do this, the weighting coefficients l are made dependent on the groups of bigrams and words b(c12, c2) that make up the history for the word in question. This method is called deleted interpolation. Other smoothing schemes are also used. When modeling a language using trigrams, the volume of dictionary data usually ranges from 1 million to 500 million words, with a corresponding dictionary volume from 1 thousand to 267 thousand words.

5.3 Complexity (perplexity)

To compare recognition systems, you can use the error rate. This metric best evaluates language models. However, there is a less expensive way to evaluate nuclear materials. It uses a quantity characterizing the amount of information - entropy. The idea is to calculate entropy for new text that was not used when creating the model. The vocabulary entropy calculated directly from the text is compared with the entropy calculated from ML. The NM whose entropy is closest to the text one will be the best.

Let us denote as p(x) the correct probability distribution of words in a text segment x consisting of k words. Let us define the entropy of the text based on the dictionary basis as:

If the words in the text are equally probable, and the text size is V, then H=log2V, for other distributions H?log2V. You can use NM to determine the probability in a text segment. The value of the logarithm of probability for NM is:

where pО(wi/hi) are the probabilities determined by a given ML. Limit, i.e. calculated using NM, is not lower than the entropy of the text. Obviously, the goal of comparing different NMs is to find one for which the logarithm of probability calculated from the NM will be closest to the entropy calculated from the text.

Perplexity characterizes the level of the logarithm of the NM probability and is defined as 2lp. Roughly speaking, this is the average size of the dictionary from which another word upon recognition. Perplexity depends on the speech domain being used. Perplexity values for some speech domains are given in Table 5.1.

speech recognition acoustic language

Table 5.1. Perplexity of speech domains

5.4 Dictionary size

The error rate cannot be lower than the percentage of spoken words that are not included in the dictionary. Therefore, the main part of building a ML is to develop a dictionary that maximally covers the texts that the system is likely to recognize. This remains a human challenge.

When creating a dictionary, texts are first selected that characterize the task with which the system will work. Then the texts are divided into words using automation tools. Next, each word is associated with a set of its pronunciation options, including possible future options. All obtained pronunciation options are used to compose trigrams.

Table 5.2 shows the percentage of coverage of new texts in English by the recognition system when using a dictionary of a fixed size. In languages with a large number of word forms and dependencies in word formation (German, French), a much larger dictionary is required for the same degree of coverage.

A more rational approach involves compiling a personalized dictionary for each user of the recognition system in addition to the fixed dictionary. Table 5.2 shows the growth in the coverage of new words by such a dynamically customizable system with an initial, fixed dictionary volume of 20 thousand words. The data is compared with a system using a static dictionary of the same size when recognizing text of the represented length.

Table 5.2. Quality of recognition of new texts

5.5 Improved language models

There are many improvements to YaM based on trigrams. The main ones are mentioned below.

Class Models

Instead of words in a language model, you can use a set of word classes. Classes can overlap because a word can belong to different classes. Classes can be based on parts of speech, morphological analysis of a word, and can be determined automatically based on statistical relationships. The general class model looks like this:

where ci are classes. If the classes do not intersect, then:

The perplexity of such a model is higher than that based on trigrams, but it decreases when combining models of these two types.

Dynamic models

Here the past is taken into account, lasting the entire document. This is done to detect frequently occurring words (for example, in this text, the word “model” is a frequent word). Using a CACHE for such words makes it possible to make the ML more dynamic, reducing search time.

Combination models

Another approach is to divide the entire speech database into several clusters. To model a new text, a linear combination of trigram models from different clusters is used:

where pj() is evaluated against the jth text cluster.

Structural models

In these models, instead of influencing the probability of a word by immediate prior history, parsing is used. With the help of such parsing, a connection is established between deleted words, which has recently been proposed to be taken into account when composing remote bigrams.

5.6 Prospects

The main areas where efforts are currently focused are:

Dictionary selection

How to define the dictionary of a new speech domain, practically personalize the dictionary for the user, maximizing the text coverage. This problem is most significant for languages with a large number of word forms and eastern languages, where the concept of a word is not clearly defined.

Speech domain adaptation

This is the task of setting up an effective ML for domains that do not have a large amount of dictionary data available to the machine, as well as determining the topic of conversation. This would make it possible to apply a specific, thematic model for speech recognition.

Using language structure

The current level of assessing the quality of system operation does not allow improving system operation using the structure of the language. Developing a language model based on the structure of language may be the key to progress in language modeling. Current advances based on probabilistic models reflect the childhood stage in the development of language modeling. Progress here is associated with increased data structuring.

CONCLUSION

This paper discusses the basic principles of constructing speech recognition systems at the present stage of development, their classification, and the problems they solve. A modern approach to noise immunity of systems is considered.

The structure of the system, the main tasks solved by its components, the principles of preliminary processing of the source signal, the construction of acoustic and language models, are presented.

Similar documents

Digital signal processing and its use in speech recognition systems, discrete signals and methods for their conversion, the basics of digital filtering. Implementation of speech recognition systems, homomorphic speech processing, recording and playback interface.

thesis, added 06/10/2010

Advantages of radio channels security systems. The main directions of speech coding: waveform coding and source coding. Block diagram of the speech processing process in the GSM standard. Speech coding quality assessment.

abstract, added 10/20/2011

Tasks in speech and data transmission. Digital speech transmission. Categories of digital speech coding methods. Waveform encoders. Type of amplitude characteristic of the compressor. Discrete model of speech production. Features of the short-term analysis method.

test, added 12/18/2010

Consideration of the main stages in solving the problem of optimizing signal reception. Study of methods for filtering and optimizing solutions. Probabilistic approach to assessing signal reception; determining the probability of recognition errors. Static recognition criteria.

presentation, added 01/28/2015

Speech coding RPE – LTP – 16 kbit/s encoder. Structure of a speech decoder in the GSM standard. Reflection coefficients of short-term prediction using the Berg method for 8th order RF. Spectral characteristics of the post-filter. Formation of formant regions.

abstract, added 11/15/2010

Block diagrams of homomorphic processing and analysis of speech signals. Complex cepstrum of speech. Component of the speech signal. Pitch period and formant frequency. Vocal tract transfer function module. Cepstrum-based pitch estimation.

abstract, added 11/19/2008

General classification radio access systems and networks. Classification of radio access systems according to the parameters and characteristics of the radio interface. Systems with analogue and digital transmission. Services digital transmission speech. Classification according to applied problems to be solved.

abstract, added 10/06/2010

State of the problem of automatic speech recognition. Review of audio signal reading devices. Architecture of the peripheral device control system. Electrical device control circuit. Schematic diagram of connecting electrical devices.

thesis, added 10/18/2011

Information characteristics and block diagram of the transmission system; calculation of analog-to-digital converter parameters and output signal. Coding with correction code. Determining the characteristics of the modem; comparison of noise immunity of communication systems.

course work, added 05/28/2012

The structure of radio signal processing devices, internal structure and operating principle, signal processing algorithms. The basis for generating a signal at the output of a linear device. Models of linear devices. Calculation of the operator transmission coefficient of the circuit.

Belenko M.V. 1, Balakshin P.V. 2

1 student, ITMO University, 2 candidate technical sciences, assistant, ITMO University

COMPARATIVE ANALYSIS OF OPEN SOURCE SPEECH RECOGNITION SYSTEMS

annotation

The article provides a comparative analysis of the most common open source automatic speech recognition systems. During the comparison, many criteria were used, including system structures, programming languages used for implementation, the availability of detailed documentation, supported recognition languages, and restrictions imposed by the license. Experiments were also conducted on several speech corpora to determine the speed and accuracy of recognition. As a result, for each of the systems considered, recommendations for use were developed with an additional indication of the scope of activity.

Keywords: speech recognition, metric, Word Recognition Rate (WRR), Word Error Rate (WER), Speed Factor (SF), open source

Belenko M.V. 1, Balakshin P.V. 2

1 student, ITMO University, 2 PhD in Engineering, assistant, ITMO University

COMPARATIVE ANALYSIS OF SPEECH RECOGNITION SYSTEMS WITH OPEN CODE

Abstract

The paper provides the comparison of the most common automatic speech recognition systems with open source code. Many criteria were used at comparison, including system structures, programming languages of implementation, detailed documentation, supported recognition languages, and restrictions imposed by the license. Also, there were conducted the experiments on the several speech bases for determination of speed and accuracy of the recognition. As a result, the recommendations were given for application with additional indication of the scope of activity for each of the systems examined.

Keywords: speech recognition, metric, Word Recognition Rate (WRR), Word Error Rate (WER), Speed Factor (SF), open source code

Speech recognition systems (Automatic Speech Recognition Systems) are mainly used to simulate communication between a person and a machine, for example, for voice control of programs. Currently, speech signal recognition is used in a wide range of systems - from applications on smartphones to Smart Home systems. Further evidence of the relevance of this field is the many research and development centers around the world. However, the vast majority of operating systems are proprietary products, i.e. the user or potential developer does not have access to their source code. This negatively affects the ability to integrate speech recognition systems into open source projects. There is also no centralized source of data describing the positive and negative aspects of open source speech recognition systems. As a result, the problem of choice arises optimal system speech recognition to solve the problem.

As part of the work, six open source systems were considered: CMU Sphinx, HTK, iAtros, Julius, Kaldi and RWTH ASR. The selection is based on frequency of mention in contemporary research journals, existing developments in recent years, and popularity among individual software developers. The selected systems were compared in terms of such indicators as recognition accuracy and speed, ease of use and internal structure.

In terms of accuracy, the systems were compared using the most common metrics: Word Recognition Rate (WRR), Word Error Rate (WER), which are calculated using the following formulas:

where S is the number of operations to replace words, I is the number of operations to insert words, D is the number of operations to remove words from a recognized phrase to obtain the original phrase, and T is the number of words in the original phrase and is measured as a percentage. In terms of recognition speed, the comparison was made using Real Time Factor - an indicator of the ratio of recognition time to the duration of the recognized signal, also known as Speed Factor (SF). This indicator can be calculated using the formula:

where T ref is the signal recognition time, T is its duration and is measured in fractions of real time.

All systems were trained using the WSJ1 (Wall Street Journal 1) speech corpus, which contains approximately 160 hours of training data and 10 hours of test data, which are excerpts from the Wall Street Journal newspaper. This speech corpus includes recordings of speakers of both genders in English.

After conducting the experiment and processing the results, the following table was obtained (Table 1).

Table 1 – Comparison results for accuracy and speed

System	WER, %	WRR, %	SF
HTK	19,8	80,2	1.4
CMU Sphinx (pocketsphinx/sphinx4)	21.4/22.7	78.6/77.3	0.5/1
Kaldi	6.5	93.5	0.6
Julius	23.1	76.9	1.3
iAtros	16.1	83.9	2 .1
RWTH ASR	15.5	84.5	3.8

The accuracy and correctness of the study is confirmed by the fact that the results obtained are similar to the results obtained when testing these systems on other speech corpora, such as Verbmobil 1, Quaero, EPPS, , .

The criteria for comparing structures were the language of the system implementation, the algorithms used in recognition, the formats of input and output data, and the internal structure of the software implementation of the system itself.

The speech recognition process in general can be represented in the following stages:

Extracting acoustic features from the input signal.
Acoustic modeling.
Language modeling.
Decoding.

The approaches, algorithms and data structures used by the speech recognition systems under consideration at each of the listed stages are presented in the tables (Tables 2, 3).

Table 2 – Results of comparison of algorithms

System	Feature Extraction	Acoustic modeling	Language modeling	Recognition
HTK	MFCC	HMM	N-gram	Viterbi algorithm
CMU Sphinx	MFCC, PLP	HMM	N-gramm, FST	Viterbi algorithm, bushderby algorithm
Kaldi	MFCC, PLP	HMM, GMM, SGMM, DNN	FST, there is an N-gramm->FST converter	Two-pass forward-reverse algorithm
Julius	MFCC, PLP	HMM	N-gramm, Rule-based	Viterbi algorithm
iAtros	MFCC	HMM, GMM	N-gramm, FST	Viterbi algorithm
RWTH ASR	MFCC, PLP, voicedness	HMM, GMM	N-gramm, WFST	Viterbi algorithm

Table 3 – System implementation languages and their structure

System	Language	Structure
HTK	WITH	Modular, in the form of utilities
CMU Sphinx (pocketsphinx/sphinx4)	C/Java	Modular
Kaldi	C++	Modular
Julius	C	Modular
iAtros	C	Modular
RWTH ASR	C++	Modular

From the point of view of ease of use, such indicators as documentation detail, support for various software and hardware execution environments, licensing restrictions, support for multiple natural recognition languages, and interface characteristics were considered. The results are presented in the following tables (Tables 4, 5, 6, 7, 8).

Table 4 – Availability of documentation

Table 5 - Support for various operating systems

System	Supported OS
HTK	Linux, Solaris, HPUX, IRIX, Mac OS, FreeBSD, Windows
CMU Sphinx (pocketsphinx/sphinx4)	Linux, Mac OS, Windows, Android
Kaldi	Linux, Windows, FreeBSD
Julius	Linux, Windows, FreeBSD, Mac OS
iAtros	Linux
RWTH ASR	Linux, Mac OS

Table 6 - System interfaces

Table 7 – Supported recognition languages

Table 8 - Licenses

System	License
HTK	HTK
CMU Sphinx (pocketsphinx/sphinx4)	BSD
Kaldi	Apache
Julius	BSD-like
iAtros	GPLv3
RWTH ASR	RWTH ASR

Having analyzed the results obtained above, it is possible to characterize each of the systems under consideration and develop recommendations for their use.

Kaldi. This system shows the best recognition accuracy of all considered systems (WER=6.5%) and the second recognition speed (SF=0.6). From the point of view of the provided algorithms and data structures used for speech recognition, this system is also a leader, as it provides greatest number modern approaches used in the field of speech recognition, such as the use of neural networks and Gaussian mixture models at the stage of acoustic modeling and the use of finite state machines at the stage of language modeling. It also allows you to use many algorithms to reduce the size of acoustic signal features, and, accordingly, increase system performance. Kaldi is written in the C++ programming language, which has a positive effect on the speed of the system, and has a modular structure, which makes it easy to refactor the system, add new functionality, and correct existing errors. In terms of usability, Kaldi is also one of the first systems. It provides detailed documentation, but is aimed at readers experienced in speech recognition. This may have a negative impact on the use of this system by those new to the field. It is cross-platform, that is, it runs on most modern operating systems. Kaldi only provides a console interface, which makes integration into third-party applications difficult. By default, this system only supports English language, is distributed under a completely free Apache license, that is, it can be integrated into a commercial product without disclosing its code. This system can be successfully used for research activities, as it provides good recognition accuracy, acceptable recognition speed, and implements many modern methods speech recognition, has many ready-made recipes, which makes it easy to use and has comprehensive documentation.

CMU Sphinx. This speech recognition system shows mediocre recognition accuracy (WER~22%) and the best recognition speed of all those reviewed (SF=0.5). It should be noted that the highest recognition speed is achieved when using the pocketsphinx decoder written in C; the sphinx4 decoder shows a very average operating speed (SF=1). Structurally, this system also uses many modern approaches to speech recognition, including a modified Viterbi algorithm, but there are fewer approaches used than Kaldi. In particular, at the acoustic modeling stage, this system only works with hidden Markov models. CMU Sphinx includes two decoders - pocketsphinx, implemented in C, and sphinx4, implemented in Java. This allows the system to be used on multiple platforms, including the Android operating system, and also facilitates integration into projects written in Java. This system has a modular structure, which has a positive effect on the ability to quickly make changes and correct errors. In terms of ease of use, CMU Sphinx is ahead of Kaldi, since in addition to the console interface it provides an API, which significantly simplifies the process of integrating the system into a third-party application. It also has detailed documentation, which, unlike Kaldi, is aimed at the novice developer, which greatly simplifies the process of getting to know the system. Also strong point This system is supported by many languages by default, that is, the availability of language and acoustic models of these languages in free access. Among the supported languages, in addition to standard English, there are also Russian, Kazakh and a number of others. CMU Sphinx is distributed under the BSD license, which allows its integration into commercial projects. This system can be used in commercial projects, as it has most of the advantages of Kaldi, although it provides slightly worse recognition accuracy, and also provides an API that can be used to build third-party applications based on this system.

HTK. In terms of accuracy and speed, this system shows average results among the reviewed systems (WER=19.8%, SF=1.4). HTK provides only classic algorithms and data structures in the field of speech recognition. This is due to the fact that the previous version of the system was released in 2009. A new version of HTK was released at the end of December 2015, but was not considered in this study. This system is implemented in the C language, which is well reflected in the speed of operation, since C is a low-level programming language. The structure of this system is a set of utilities called from the command line, and also provides an API known as ATK. In terms of ease of use, HTK, along with Julius, is the leading system among those reviewed. As documentation, it provides the HTK Book, a book that describes not only aspects of HTK's operation, but also the general principles of speech recognition systems. By default, this system only supports English. Distributed under the HTK license, which allows distribution of the system's source code. This system can be recommended for use in educational activities in the field of speech recognition. It implements most of the classic approaches to solving the speech recognition problem, has very detailed documentation that also describes the basic principles of speech recognition in general, and has many tutorials and recipes.

Julius. This system shows the worst accuracy rate (WER=23.1) and average recognition rate (SF=1.3). The acoustic and language modeling stages are carried out using the utilities included in HTK, but decoding occurs using its own decoder. It, like most of the systems discussed, uses the Viterbi algorithm. This system is implemented in C language, the implementation structure is modular. The system provides a console interface and API for integration into third-party applications. Documentation, like in HTK, is implemented in the form of a Julius book. By default, Julius supports English and Japanese. Distributed under a BSD-like license. The Julius system can also be recommended for educational activities, since it has all the advantages of HTK, and also provides the ability to recognize such exotic language like Japanese.

Iatros. This system shows a good result in recognition accuracy (WER=16.1%) and a mediocre result in speed (SF=2.1). It is very limited in its capabilities regarding algorithms and data structures used in speech recognition, but it does provide the ability to use Gaussian mixture models as states of a hidden Markov model at the acoustic modeling stage. This system is implemented in C language. It has a modular structure. In addition to speech recognition functionality, it also contains a text recognition module. It doesn't have of great importance for this study, however, it is a distinctive feature of this system that cannot be ignored. In terms of ease of use, iAtros is inferior to all systems examined during the study. This system does not have documentation, does not provide an API for embedding in third-party applications; the default languages supported are English and Spanish. It is not at all cross-platform, as it runs only under operating systems of the Linux family. Distributed under the GPLv3 license, which does not allow this system to be integrated into commercial projects without disclosing their source code, which makes it unsuitable for use in commercial activities. The iAtros system can be successfully used where, in addition to speech recognition, it is also necessary to use image recognition, since this system provides such an opportunity.

RWTH ASR. In terms of recognition accuracy, RWTH ASR shows a good result (WER=15.5%), but in terms of recognition speed it is the worst system among those considered (SF=3.8). This system, like iAtros, can use Gaussian mixture models at the acoustic modeling stage. Distinctive feature is the possibility of using the voicing characteristics when extracting the acoustic characteristics of the input signal. Also, this system can use a weighted state machine as a language model during the language modeling stage. This system is implemented in C++ and has a modular architecture. In terms of ease of use, it is second to last; it has documentation that describes only the installation process, which is clearly not enough to start working with the system. Provides only a console interface, by default only supports English. The system is not cross-platform enough, since it cannot run under the Windows operating system, which is very common nowadays. Distributed under the RWTH ASR license, under which the system code is provided for non-commercial use only, which makes this system unsuitable for integration into commercial projects. This system can be used to solve problems where recognition accuracy is important, but time is not important. It is also worth noting that it is completely unsuitable for any commercial activity due to the restrictions imposed by the license.

List of literature / References

CMU Sphinx Wiki [ Electronic resource]. – URL: http://cmusphinx.sourceforge.net/wiki/ (access date: 01/09/2017)
Gaida C. Comparing open-source speech recognition toolkits [Electronic resource]. / C. Gaida et al. // Technical Report of the Project OASIS. – URL: http://suendermann.com/su/pdf/oasis2014.pdf (access date: 02/12/2017)
El Moubtahij H. Using features of local densities, statistics and HMM toolkit (HTK) for offline Arabic handwritten text recognition / H. El Moubtahij, A. Halli, K. Satori // Journal of Electrical Systems and Information Technology – 2016. – V 3. No. 3. – P. 99-110.
Jha M. Improved unsupervised speech recognition system using MLLR speaker adaptation and confidence measurement / M. Jha et al. // V Jornadas en Tecnologıas del Habla (VJTH’2008) – 2008. – P. 255-258.
Kaldi [Electronic resource]. – URL: http://kaldi-asr.org/doc (access date: 12/19/2016)
Luján-Mares M. iATROS: A SPEECH AND HANDWRITING RECOGNITION SYSTEM / M. Luján-Mares, V. Tamarit, V. Alabau et al. // V Journadas en Technologia del Habla - 2008. - P. 75-58.
El Amrania M.Y. Building CMU Sphinx language model for the Holy Quran using simplified Arabic phonemes / M.Y. El Amrania, M.M. Hafizur Rahmanb, M.R. Wahiddinb, A. Shahb // Egyptian Informatics Journal – 2016. – V. 17. No. 3. – P. 305–314.
Ogata K. Analysis of articulatory timing based on a superposition model for VCV sequences / K. Ogata, K. Nakashima // Proceedings of IEEE International Conference on Systems, Man and Cybernetics - 2014. - January ed. – P. 3720-3725.
Sundermeyer The rwth 2010 quaero asr evaluation system for english, French, and German / M. Sundermeyer et al. // Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP) – 2011. – P. 2212-2215.
Alimuradov A.K. ADAPTIVE METHOD OF INCREASING THE EFFECTIVENESS OF VOICE CONTROL / A.K. Alimuradov, P.P. Churakov // Proceedings of the International Scientific and Technical Conference “Advanced Information Technologies” – 2016. – P. 196-200.
Bakalenko V.S. Intellectualization of program code input/output using speech technologies: dis. ... Master of Engineering and Technology. – DonNTU, Donetsk, 2016.
Balakshin P.V. Algorithmic and software speech recognition tools based on hidden Markov models for telephone customer support services: dis. ...cand. tech. Sciences: 13/05/11: protected 12/10/2015: approved. 06/08/2016 / Balakshin Pavel Valerievich. – St. Petersburg: ITMO University, 2014. – 127 p.
Balakshin P.V. DENSITY FUNCTION OF SMM STATE DURATION. ADVANTAGES AND DISADVANTAGES / P.V. Balakshin // Modern problems of science and education. – 2011. – No. 1. – P. 36-39. URL: http://www.science-education.ru/ru/article/view?id=4574 (access date: 11/13/2016).
Belenko M.V. COMPARATIVE ANALYSIS OF OPEN CODE SPEECH RECOGNITION SYSTEMS / M.V. Belenko // Collection of works of the V All-Russian Congress of Young Scientists. T. 2. – St. Petersburg: ITMO University, 2016. – P. 45-49.
Gusev M.N. Speech recognition system: basic models and algorithms / M.N. Gusev, V.M. Degtyarev. – St. Petersburg: Znak, 2013. – 128 p.
Karpov A.A. Multimodal assistive systems for intelligent living space / A.A. Karpov, L. Akarun, A.L. Ronzhin // Proceedings of SPIIRAN. – 2011. – T. 19. – No. 0. – P. 48-64.
Karpov A.A. Methodology for assessing the performance of automatic speech recognition systems / A.A. Karpov, I.S. Kipyatkova // News of the Higher educational institutions. Instrumentation. – 2012. – T. 55. – No. 11. – pp. 38-43.
Tampel I.B. Automatic speech recognition – main stages over 50 years / I.B. Tampel // Scientific and Technical Bulletin information technologies, mechanics and optics. – 2015. – T. 15. – No. 6. – P. 957–968.

List of references in English /References in English

CMU Sphinx Wiki. – URL: http://cmusphinx.sourceforge.net/wiki/ (accessed: 01/09/2017).
Gaida C. Comparing open-source speech recognition toolkits. / C. Gaida et al. // Technical Report of the Project OASIS. – URL: http://suendermann.com/su/pdf/oasis2014.pdf (accessed: 02.12.2017)
El Moubtahij, H. Using features of local densities, statistics and HMM toolkit (HTK) for offline Arabic handwritten text recognition / H. El Moubtahij, A. Halli, K. Satori // Journal of Electrical Systems and Information Technology – 2016. – V. 3. No. 3. – P. 99-110.
Jha, M. Improved unsupervised speech recognition system using MLLR speaker adaptation and confidence measurement / M. Jha et al. // V Jornadas en Tecnologıas del Habla (VJTH’2008) – 2008. – P. 255-258.
Kaldi. – URL: http://kaldi-asr.org/doc (accessed: 12/19/2016)
Luján-Mares, M. iATROS: A SPEECH AND HANDWRITING RECOGNITION SYSTEM / M. Luján-Mares, V. Tamarit, V. Alabau et al. // V Journadas en Technologia del Habla - 2008. - P. 75-58.
El Amrania, M.Y. Building CMU Sphinx language model for the Holy Quran using simplified Arabic phonemes / M.Y. El Amrania, M.M. Hafizur Rahmanb, M.R. Wahiddinb, A. Shahb // Egyptian Informatics Journal – 2016. – V. 17. No. 3. – P. 305–314.
Ogata, K. Analysis of articulatory timing based on a superposition model for VCV sequences / K. Ogata, K. Nakashima // Proceedings of IEEE International Conference on Systems, Man and Cybernetics - 2014. - January ed. – P. 3720-3725.
Sundermeyer, M. The rwth 2010 quaero asr evaluation system for english, french, and german / M. Sundermeyer et al. // Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP) – 2011. – P. 2212-2215.
Alimuradov A.K. ADAPTIVNYJ METOD POVYSHENIJa JeFFEKTIVNOSTI GOLOSOVOGO UPRAVLENIJa / A.K. Alimuradov, P.P. Churakov // Trudy Mezhdunarodnoj nauchno-tehnicheskoj konferencii “Perspektivnye informacionnye tehnologii”. – 2016. – P. 196-200.
Bakalenko V.S. Intellektualizatsiya vvoda-vyivoda koda programmyi s pomoschyu rechevyih tehnologiy: dis. ... of Master in Engineering and Technology. – DonNTU, Donetsk, 2016.
Balakshin P.V. Algoritmicheskie i programmnyie sredstva raspoznavaniya rechi na osnove skryityih markovskih modeley dlya telefonnyih sluzhb podderzhki klientov: dis. ... PhD in Engineering: 13/05/11: defense of the thesis 12/10/2015: approved 06/08/2016 / Balakshin Pavel Valer’evich. – SPb.: ITMO University, 2014. – 127 p.
Balakshin P.V. FUNKCIJa PLOTNOSTI DLITEL’NOSTI SOSTOJaNIJ SMM. PREIMUShhESTVA I NEDOSTATKI / P.V. Balakshin // Sovremennye problemy nauki i obrazovanija. – 2011. – No. 1. – P. 36-39. URL: http://www.science-education.ru/ru/article/view?id=4574 (accessed: 11/13/2016).
Belenko M.V. SRAVNITELNYY ANALIZ SISTEM RASPOZNAVANIYA RECHI S OTKRYTYM KODOM / M.V. Belenko // Sbornik trudov V Vserossiyskogo kongressa molodyih uchenyih. V. 2. – SPb.: ITMO University, 2016. P. 45-49.
Gusev M.N. Sistema raspoznavaniya rechi: osnovnyie modeli i algoritmyi / M.N. Gusev V.M. Degtyarev. – SPb.: Znak, 2013. – 141 p.
Karpov A.A. Mnogomodalnyie assistivnyie sistemyi dlya intellektualnogo zhilogo prostranstva / A.A. Karpov, L. Akarun, A.L. Ronzhin // Trudyi SPIIRAN. – 2011. – V. 19. – No. 0. – P. 48-64.
Karpov A.A. Metodologiya otsenivaniya rabotyi sistem avtomaticheskogo raspoznavaniya rechi / A.A. Karpov, I.S. Kipyatkova // Izvestiya vyisshih uchebnyih zavedeniy. Priborostroenie. – 2012. – V. 55. – No. 11. – P. 38-43.
Tampel I.B. Avtomaticheskoe raspoznavanie rechi – osnovnyie etapyi za 50 let / I.B. Tampel // Nauchno-Tehnicheskii Vestnik Informatsionnykh Tekhnologii, Mekhaniki i Optiki. – 2015. – V. 15. – No. 6. – P. 957–968.

When we listen to someone speak, our inner ear analyzes the frequency spectrum of the sound and the brain perceives the word. Some computers can simulate this process using a spectrum analyzer.

Sound signals enter the analyzer through a microphone, and their spectral characteristics are analyzed. The computer then compares the received signals with a programmed list of phonemes, or acoustic building blocks. Short-term signals are compared with standard word patterns and related to the rules of language and syntax.

This process helps the computer identify spoken words. If the program is sophisticated enough, it can even determine from the context whether the word "fruit" or "raft" was spoken. But whether a computer can truly understand speech the way humans do remains a hotly debated topic to this day. You can program your computer to respond to certain combinations words, but will this replace real understanding? Some experts in the field artificial intelligence They believe that in a few decades a computer will be able to conduct a relevant, casual conversation with a person. Nevertheless, many experts are convinced that the computer will always be limited by the program, pre-compiled answers.

Voice recognition

Sounds spoken for more than a few seconds are broken up into shorter time segments. The computer then analyzes the frequency components of each segment.

Acoustic analysis

The sound spectrograph represents the spectrum of sound in visible form. With one method of analysis, a normal chain of sounds human voice is broken down into segments, color coded to indicate the strength and frequency of their components. Three-dimensional graphs, like the one above, depict another way to visualize such information.

Decision-making

Based on the analysis results, the computer decides whether the given word. The computer compares the recorded analysis with a list of possible candidates, then applies lexical and syntax rules to determine whether a particular sound matches a particular word.

Standard speech patterns

The smallest units of speech are defined in terms of the frequency spectrum. Standard speech patterns indicate which unit is present in a given word.

The sound spectrograph (above) performs acoustic analysis of the sounds in spoken words. Here the vowel sound (top left) is compared to the vowel spectrum (bottom).

Sound waves cause the eardrum to vibrate. This vibration is transmitted to several small bones and converted into electrical signals that travel to the brain.