What does computational linguistics study? Computational linguistics

Since 2012, the Institute of Linguistics of the Russian State University for the Humanities has been training masters under the master's program Computational linguistics(direction Fundamental and applied linguistics). This program is designed to prepare professional linguists who master both the basics of linguistics and modern methods of research, expert-analytical, engineering work and are able to effectively participate in the development of innovative language computer technologies.

Developers of large research and commercial systems in the field of automatic text processing participate in the educational process, which ensures the connection of master's training with the mainstream of modern computer linguistics. Special attention is paid to the participation of masters in Russian and international conferences.

Among the teachers are the authors of basic textbooks on linguistic specialties, world-class specialists, project managers of large automatic language processing systems: Ya.G. Testelets, I.M. Boguslavsky, V.I. Belikov, V.I. Podlesskaya, V.P. Selegey, L.L. Iomdin, A.S. Starostin, S.A. Sharov, as well as employees of companies that are world leaders in the field of computational linguistics: IBM (Watson system), Yandex, ABBYY (Lingvo, FineReader, Compreno systems).

The basis for training masters in this program is the project approach. The involvement of master's students in research work in the field of computational linguistics takes place on the basis of the Russian State University for the Humanities and on the basis of companies developing programs in the field of AOT (ABBYY, IBM, etc.), which, of course, is a big plus both for the masters themselves and for their possible employers. In particular, targeted masters are admitted to the master's program, whose training is provided by future employers.

Entrance tests: "Formal models and methods of modern linguistics." Accurate information about the exam time can be obtained on the website of the Master's Department of the Russian State University for the Humanities.

Heads of the magistracy - head. Educational and Scientific Center for Computational Linguistics, Director of Linguistic Research at ABBYY Vladimir Pavlovich Selegey and Doctor of Philology, Professor Vera Isaakovna Podleskaya .

Program for the entrance exam and interview in the discipline “Formal models and methods of modern linguistics.”

Comments on the program

  • Any program question can be accompanied by tasks related to descriptions of specific linguistic phenomena, related to the section of the question: construction of structures, description of restrictions, possible algorithms for construction and/or identification.
  • Questions marked with asterisks are optional (they appear under number 3 on the tickets). Knowledge of relevant material is a major bonus for candidates, but is not required.
  • In addition to theoretical questions, the exam tickets will offer a small fragment of a special (linguistic) text in English language– for translation and discussion. Applicants are required to demonstrate a satisfactory level of proficiency in English scientific terminology and scientific text analysis skills. As an example of a text that should not cause serious difficulties for the applicant, below is a fragment of the article https://en.wikipedia.org/wiki/Anaphora_(linguistics):

In linguistics, anaphora (/əˈnæfərə/) is the use of an expression whose interpretation depends upon another expression in context (its antecedent or postcedent). In a narrower sense, anaphora is the use of an expression that depends specifically upon an antecedent expression and thus is contrasted with cataphora, which is the use of an expression that depends upon a postcedent expression. The anaphoric (referring) term is called an anaphor. For example, in the sentence Sally arrived, but nobody saw her, the pronoun her is an anaphor, referring back to the antecedent Sally. In the sentence Before her arrival, nobody saw Sally, the pronoun her refers forward to the postcedent Sally, so her is now a cataphor (and an anaphor in the broader, but not the narrower, sense). Usually, an anaphoric expression is a proform or some other kind of deictic (contextually-dependent) expression. Both anaphora and cataphora are species of endophora, referring to something mentioned elsewhere in a dialog or text.

Anaphora is an important concept for different reasons and on different levels: first, anaphora indicates how discourse is constructed and maintained; second, anaphora binds different syntactical elements together at the level of the sentence; third, anaphora presents a challenge to natural language processing in computational linguistics, since the identification of the reference can be difficult; and fourth, anaphora tells some things about how language is understood and processed, which is relevant to fields of linguistics interested in cognitive psychology.

THEORETICAL ISSUES

GENERAL ISSUES OF LINGUISTICS

  • Object of linguistics. Language and speech. Synchrony and diachrony.
  • Levels of language. Formal models of language levels.
  • Syntagmatics and paradigmatics. The concept of distribution.
  • Foundations of interlingual comparisons: typological, genealogical and areal linguistics.
  • *Mathematical linguistics: object and research methods

PHONETICS

  • Phonetics subject. Articulatory and acoustic phonetics.
  • Segmental and suprasegmental phonetics. Prosody and intonation.
  • Basic concepts of phonology. Typology of phonological systems and their phonetic implementations.
  • *Computer tools and methods of phonetic research
  • *Speech analysis and synthesis.

MORPHOLOGY

  • Subject of morphology. Morphs, morphemes, allomorphs.
  • Inflection and word formation.
  • Grammatical meanings and ways to implement them. Grammatical categories and grammemes. Morphological and syntactic grammatical meanings.
  • The concepts of word form, stem, lemma and paradigm.
  • Parts of speech; basic approaches to identifying parts of speech.
  • *Formal models for describing inflection and word formation.
  • *Morphology in automatic language processing tasks: spell checking, lemmatization, POS-tagging

SYNTAX

  • Subject of syntax. Ways of expressing syntactic relations.
  • Ways of representing the syntactic structure of a sentence. Advantages and disadvantages of dependency trees and components.
  • Ways to describe linear order. Lack of projectivity and rupture of components. The concept of transformation; transformations associated with linear order.
  • The relationship between syntax and semantics: valences, control models, actants and circonstants.
  • Diathesis and collateral. Actant derivation.
  • Communicative organization of utterance. Theme and rheme, given and new, contrast.
  • *Main syntactic theories: MST, generativism, functional grammar, HPSG
  • *Mathematical models of syntax: classification of formal languages ​​according to Chomsky, recognition algorithms and their complexity.

SEMANTICS

  • Subject of semantics. Naive and scientific linguistic pictures of the world. Sapir–Whorf hypothesis.
  • Meaning in language and speech: meaning and referent. Type of reference (denotative status).
  • Lexical semantics. Ways to describe the semantics of a word.
  • Grammatical semantics. Main categories using the example of the Russian language.
  • Semantics of the sentence. Propositional component. Deixis and anaphora. Quantifiers and connectives. Modality.
  • Hierarchy and systematicity of lexical meanings. Polysemy and homonymy. Semantic structure ambiguous word. The concepts of invariant and prototype.
  • Paradigmatic and syntagmatic relations in vocabulary. Lexical functions.
  • Interpretation. Language of interpretation. Moscow Semantic School
  • Semantics and logic. The truth value of the statement.
  • Theory of speech acts. The utterance and its illocutionary force. Performatives. Classification of speech acts.
  • Phraseology: inventory and methods of describing phraseological units.
  • *Models and methods of formal semantics.
  • *Models of semantics in modern computational linguistics.
  • *Distributive and operational semantics.
  • *Basic ideas of construction grammar.

TYPOLOGY

  • Traditional typological classifications of languages.
  • Typology of grammatical categories of noun and verb.
  • Typology simple sentence. The main types of constructions: accusative, ergative, active.
  • Typology of word order and Greenberg correlations. Left- and right-branching languages.

LEXICOGRAPHY

  • Vocabulary as an inventory of culture; social variation of vocabulary, lexical usage, norm, codification.
  • Typology of dictionaries (on Russian material). Reflection of vocabulary in dictionaries of various types.
  • Bilingual lexicography using the Russian language.
  • Descriptive and prescriptive lexicography. Professional linguistic dictionaries.
  • Specifics of the main Russian explanatory dictionaries. Structure of a dictionary entry. Interpretation and encyclopedic information.
  • Vocabulary and grammar. The idea of ​​an integral model of language in the Moscow Semantic School.
  • *Methodology of a lexicographer.
  • *Corpus methods in lexicography.

LINGUISTICS OF TEXT AND DISCOURSE

  • The concept of text and discourse.
  • Mechanisms of interphrase communication. The main types of means of their linguistic implementation.
  • The sentence as a unit of language and as an element of text.
  • Superphrasal unities, principles of their formation and isolation, basic properties.
  • Main categories of text classification (genre, style, register, subject area, etc.)
  • *Methods for automatic genre classification.

SOCIOLINGUISTICS

  • The problem of the subject and boundaries of sociolinguistics, its interdisciplinary nature. Basic concepts of sociology and demography. Levels language structure and sociolinguistics. Basic concepts and directions of sociolinguistics.
  • Language contacts. Bilingualism and diglossia. Divergent and convergent processes in the history of language.
  • Social differentiation of language. Forms of existence of language. Literary language: usage-norm-codification. Functional areas of language.
  • Language socialization. The hierarchical nature of social and linguistic identity. An individual's linguistic behavior and his communicative repertoire.
  • Social methods linguistic research.

COMPUTER LINGUISTICS

  • Tasks and methods of computational linguistics.
  • Corpus linguistics. Main characteristics of the case.
  • Knowledge representation. Basic ideas of the theory of frames by M. Minsky. FrameNet system.
  • Thesauruses and ontologies. WordNet.
  • Fundamentals of statistical text analysis. Frequency dictionaries. Collocation analysis.
  • *The concept of machine learning.

LITERATURE

Educational (basic level)

Baranov A.N. Introduction to Applied Linguistics. M.: Editorial URRS, 2001.

Baranov A.N., Dobrovolsky D.O. Basics of phraseology (short course) Study guide. 2nd edition. Moscow: Flinta, 2014.

Belikov V.A., Krysin L.P. Sociolinguistics. M., Russian State University for the Humanities, 2001.

Burlak S.A., Starostin S.A. Comparative historical linguistics. M.: Academy. 2005

Vakhtin N.B., Golovko E.V.. Sociolinguistics and sociology of language. St. Petersburg, 2004.

Knyazev S.V., Pozharitskaya S.K. Modern Russian literary language: Phonetics, graphics, spelling, spelling. 2nd ed. M., 2010

Kobozeva I.M. Linguistic semantics. M.: Editorial URSS. 2004.

Kodzasov S.V., Krivnova O.F. General phonetics. M.: RSUH, 2001.

Krongauz M.A. Semantics. M.: RSUH. 2001.

Krongauz M.A. Semantics: Tasks, assignments, texts. M.: Academy. 2006..

Maslov Yu.S. Introduction to linguistics. Ed. 6th, erased. M.: Academy, phil. fak. St. Petersburg State University,

Plungyan V.A. General morphology: Introduction to the subject. Ed. 2nd. M.: Editorial URSS, 2003.

Testelets Ya.G. Introduction to general syntax. M., 2001.

Shaikevich A.Ya. Introduction to linguistics. M.: Academy. 2005.

Scientific and reference

Apresyan Yu.D. Selected works, volume I. Lexical semantics: 2nd ed., Spanish. and additional M.: School "Languages ​​of Russian Culture", 1995.

Apresyan Yu.D. Selected Works, Volume II. Integral description of language and system lexicography. M.: School "Languages ​​of Russian Culture", 1995.

Apresyan Yu.D.(ed.) New explanatory dictionary of synonyms of the Russian language. Moscow - Vienna: "Languages ​​of Russian Culture", Wiener Slavistischer Almanach, Sonderband 60, 2004.

Apresyan Yu.D.(ed.) Language picture world and systemic lexicography (ed. Yu. D. Apresyan). M.: "Languages ​​of Slavic Cultures", 2006, Preface and chapter. 1, p.26 -- 74.

Bulygina T.V., Shmelev A.D. Linguistic conceptualization of the world (based on Russian grammar). M.: School "Languages ​​of Russian Culture", 1997.

Weinreich W. Language contacts. Kyiv, 1983.

Vezhbitskaya A. Semantic universals and description of languages. M.: School "Languages ​​of Russian Culture". 1999.

Galperin I.R. Text as an object of linguistic research. 6th ed. M.: LKI, 2008 ("Linguistic heritage of the 20th century")

Zaliznyak A.A.“Russian nominal inflection” with an appendix of selected works on the modern Russian language and general linguistics. M.: Languages ​​of Slavic culture, 2002.

Zaliznyak A.A., Paducheva E.V. Towards typology relative offer. / Semiotics and computer science, vol. 35. M., 1997, p. 59-107.

Ivanov Vyach. Sun.. Linguistics of the third millennium. Questions for the future. M., 2004. P. 89-100 (11. Language situation in the world and forecast for the near future).

Kibrik A.E. Essays on general and applied issues of linguistics. M.: Moscow State University Publishing House, 1992.

Kibrik A.E. Language constants and variables. St. Petersburg: Aletheya, 2003.

Labov U. About the mechanism language changes// New in linguistics. Issue 7. M., 1975. P.320-335.

Lyons J. Linguistic semantics: Introduction. M.: Languages ​​of Slavic culture. 2003.

Lyons John. Language and linguistics. Introductory course. M: URSS, 2004

Lakoff J. Women, fire and dangerous things: What the categories of language tell us about thinking. M.: Languages ​​of Slavic culture. 2004.

Lakoff J, Johnson M. Metaphors by which we live. Per. from English Edition 2. M.: URSS. 2008.

Linguistic encyclopedic Dictionary/ Ed. IN AND. Yartseva. M.: Scientific publishing house "Big Russian Encyclopedia", 2002.

Melchuk I.A. Well general morphology. Tt. I-IV. Moscow-Vienna: "Languages ​​of Slavic culture", Wiener Slavistischer Almanach, Sonderband 38/1-38/4, 1997-2001.

Melchuk I. A. Experience in the theory of linguistic models “MEANING ↔ TEXT”. M.: School "Languages ​​of Russian Culture", 1999.

Fedorova L.L. Semiotics. M., 2004.

Filippov K. A. Linguistics of text: Course of lectures - 2nd ed., Spanish. and additional Ed. St. Petersburg University, 2007.

Haspelmath, M., et al. (eds.). World Atlas of Language Structures. Oxford, 2005.

Dryer, M.S. and Haspelmath, M.(eds.) The World Atlas of Language Structures Online. Leipzig: Max Planck Institute for Evolutionary Anthropology, 2013. (http://wals.info)

Croft W. Typology and Universals. Cambridge: Cambridge University Press, 2003. Shopen, T. (ed.). Language Typology and Syntactic Description. 2nd edition. Cambridge, 2007.

V. I. Belikov. About dictionaries “containing the norms of modern Russian literary language when used as state language Russian Federation". 2010 // Portal Gramota.Ru (http://gramota.ru/biblio/research/slovari-norm)

Computer linguistics and intellectual technologies: Based on the materials of the annual International Conference “Dialogue”. Vol. 1-11. - M.: Publishing house Nauka, from the Russian State University for the Humanities, 2002-2012. (Articles on computational linguistics, http://www.dialog-21.ru).

National Corpus of the Russian Language: 2006-2008. New results and prospects. / Rep. ed. V. A. Plungyan. - St. Petersburg: Nestor-History, 2009.

New in foreign linguistics. Vol. XXIV, Computational linguistics / Comp. B. Yu. Gorodetsky. M.: Progress, 1989.

Shimchuk E. G. Russian lexicography: Textbook. M.: Academy, 2009.

National Corpus of the Russian Language: 2003-2005. Digest of articles. M.: Indrik, 2005.

For contacts:

Educational and Scientific Center for Computational Linguistics of the Institute of Linguistics of the Russian State University for the Humanities

The Faculty of Philology of the Higher School of Economics is launching a new master’s program dedicated to computational linguistics: applicants with humanitarian and mathematical backgrounds are welcome here basic education and everyone who is interested in solving problems in one of the most promising branches of science. Its director, Anastasia Bonch-Osmolovskaya, told Theories and Practitioners what computational linguistics is, why robots will not replace humans, and what they will teach in the HSE master’s program in computational linguistics.

This program is almost the only one of its kind in Russia. Where did you study?

I studied at Moscow State University in the department of theoretical and applied linguistics of the philological faculty. I didn’t get there right away, first I entered the Russian department, but then I became seriously interested in linguistics, and I was attracted by the atmosphere that remains in the department to this day. The most important thing there is good contact between teachers and students and their mutual interest.

When I had children and needed to earn a living, I went into the field of commercial linguistics. In 2005, it was not very clear what this area of ​​activity as such was. I worked in different linguistic companies: I started with a small company at the site Public.ru - this is a kind of media library, where I started working on linguistic technologies. Then I worked for a year at Rosnanotech, where there was an idea to create an analytical portal so that the data on it would be automatically structured. Then I headed the linguistic department at the Avicomp company - this is already a serious production in the field of computer linguistics and semantic technologies. At the same time, I taught a course on computational linguistics at Moscow State University and tried to make it more modern.

Two resources for a linguist: - a site created by linguists for scientific and applied research related to the Russian language. This is a model of the Russian language, presented using a huge array of texts from different genres and periods. The texts are equipped with linguistic markup, with the help of which you can obtain information about the frequency of certain linguistic phenomena. Wordnet is a huge lexical database of the English language; the main idea of ​​Wordnet is to connect not words, but their meanings into one large network. Wordnet can be downloaded and used for your own projects.

What does computational linguistics do?

This is the most interdisciplinary field. The most important thing here is to understand what is going on in the electronic world and who will help you do specific things.

We are surrounded by a very large amount of digital information, there are many business projects, the success of which depends on the processing of information, these projects can relate to the field of marketing, politics, economics and anything else. And it is very important to be able to handle this information effectively - the main thing is not only the speed of processing information, but also the ease with which you can, after filtering out the noise, get the data you need and create a complete picture from it.

Previously, some global ideas were associated with computer linguistics, for example: people thought that machine translation would replace human translation, that robots would work instead of people. But now it seems like a utopia, and machine translation is used in search engines to quick search in an unknown language. That is, now linguistics rarely deals with abstract problems - mostly with some small things that can be inserted into a large product and make money on it.

One of the big tasks of modern linguistics is the semantic web, when the search occurs not just by matching words, but by meaning, and all sites are in one way or another marked by semantics. This can be useful, for example, for police or medical reports that are written every day. Analysis of internal connections gives a lot necessary information, and reading and counting it manually is incredibly time-consuming.

In a nutshell, we have a thousand texts, we need to sort them into groups, present each text in the form of a structure and get a table with which we can already work. This is called unstructured information processing. On the other hand, computational linguistics deals, for example, with the creation of artificial texts. There is a company that has come up with a mechanism for generating texts on topics that are boring for a person to write about: changes in real estate prices, weather forecasts, reports on football matches. It is much more expensive to order these texts for a person, and computer texts on such topics are written in coherent human language.

Yandex is actively involved in developments in the field of searching for unstructured information in Russia; Kaspersky Lab is hiring research groups who study machine learning. Is someone in the market trying to come up with something new in the field of computational linguistics?

**Books on computational linguistics:**

Daniel Jurafsky, Speech and Language Processing

Christopher Manning, Prabhakar Raghavan, Heinrich Schuetze, "Introduction to Information Retrieval"

Yakov Testelets, “Introduction to General Syntax”

Most linguistic developments are the property of large companies; almost nothing can be found in the public domain. This slows down the development of the industry; we do not have a free linguistic market or packaged solutions.

In addition, there is a lack of comprehensive information resources. There is such a project as the National Corpus of the Russian Language. This is one of the best national buildings in the world, which is rapidly developing and opens up incredible opportunities for scientific and applied research. The difference is about the same as in biology - before DNA research and after.

But many resources do not exist in Russian. Thus, there is no analogue to such a wonderful English-language resource as Framenet - this is a conceptual network where all possible connections of a particular word with other words are formally presented. For example, there is the word “fly” - who can fly, where, with what preposition this word is used, what words it is combined with, and so on. This resource helps to connect language with real life, that is, to track how people behave specific word at the level of morphology and syntax. It is very useful.

The Avicomp company is currently developing a plugin for searching articles with similar content. That is, if you are interested in an article, you can quickly look at the history of the plot: when the topic arose, what was written and when was the peak of interest in this problem. For example, with the help of this plugin it will be possible, starting from an article devoted to events in Syria, to very quickly see how events there have developed over the past year.

How will the learning process in the master's program be structured?

Education at HSE is organized in separate modules, just like in Western universities. Students will be divided into small teams, mini-startups - that is, we should get several finished projects. We want to get real products, which we will then open to people and leave in the public domain.

In addition to the immediate project managers of students, we want to find them curators from among their potential employers- from the same Yandex, for example, who will also play this game and give students some advice.

I hope that people from a variety of fields will come to the master's program: programmers, linguists, sociologists, marketers. We will have several adaptation courses in linguistics, mathematics and programming. Then we will have two serious courses in linguistics, and they will be related to the most relevant linguistic theories, we want our graduates to be able to read and understand modern linguistic articles. It's the same with mathematics. We will have a course called “Mathematical Foundations of Computational Linguistics,” which will outline those branches of mathematics on which modern computational linguistics is based.

In order to enroll in a master's program, you need to pass entrance examination in language and pass a portfolio competition.

In addition to the main courses, there will be a line of elective subjects. We have planned several cycles - two of them are focused on a more in-depth study of individual topics, which include, for example, machine translation and corpus linguistics, and one, on the contrary, is related to related areas: such as , social networks, machine learning or Digital Humanities - a course that we hope will be taught in English.

COURSE WORK

in the discipline "Informatics"

on the topic: “Computational linguistics”


INTRODUCTION

2. Modern interfaces for computational linguistics

CONCLUSION

LITERATURE


Introduction

In life modern society Automated information technologies play an important role. Over time, their importance continuously increases. But the development of information technology is very uneven: if modern level computer technology And the means of communication are amazing, but in the field of semantic processing of information, successes are much more modest. These successes depend, first of all, on achievements in the study of the processes of human thinking, the processes of verbal communication between people and the ability to model these processes on a computer.

When it comes to creating promising information technologies, the problems of automatic processing of textual information presented in natural languages ​​come to the fore. This is determined by the fact that a person’s thinking is closely connected with his language. Moreover, natural language is a tool for thinking. It is also a universal means of communication between people - a means of perception, accumulation, storage, processing and transmission of information. Problems of use natural language The science of computer linguistics deals with automatic information processing systems. This science arose relatively recently - at the turn of the fifties and sixties of the last century. Over the past half century, significant scientific and practical results have been obtained in the field of computer linguistics: systems for machine translation of texts from one natural language to another, systems for automated information retrieval in texts, systems for automatic analysis and synthesis of oral speech, and many others have been created. this work is devoted to the construction of an optimal computer interface using computer linguistics when conducting linguistic research.


1. The place and role of computational linguistics in linguistic research

In the modern world, computational linguistics is increasingly being used to conduct various linguistic studies.

Computational linguistics is a field of knowledge associated with solving problems of automatic processing of information presented in natural language. Central scientific problems Computer linguistics is the problem of modeling the process of understanding the meaning of texts (transition from text to a formalized representation of its meaning) and the problem of speech synthesis (transition from a formalized representation of meaning to texts in natural language). These problems arise when solving a number of applied problems and, in particular, the tasks of automatic detection and correction of errors when entering texts into a computer, automatic analysis and synthesis of oral speech, automatic translation of texts from one language to another, communication with a computer in natural language, automatic classification and indexing text documents, their automatic abstracting, searching for documents in full-text databases.

Linguistic tools created and used in computational linguistics can be divided into two parts: declarative and procedural. The declarative part includes dictionaries of units of language and speech, texts and various types of grammatical tables, the procedural part includes means of manipulating units of language and speech, texts and grammar tables. Computer interface refers to the procedural part of computational linguistics.

Success in solving applied problems of computer linguistics depends, first of all, on the completeness and accuracy of the representation of declarative means in computer memory and on the quality of procedural means. To date, the required level of solving these problems has not yet been achieved, although work in the field of computational linguistics is being carried out in all developed countries of the world (Russia, USA, England, France, Germany, Japan, etc.).

Nevertheless, serious scientific and practical achievements in the field of computational linguistics can be noted. Thus, in a number of countries (Russia, USA, Japan, etc.) experimental and industrial systems for machine translation of texts from one language to another have been built, a number of experimental systems communication with computers in natural language, work is underway to create terminological data banks, thesauruses, bilingual and multilingual machine dictionaries (Russia, USA, Germany, France, etc.), systems for automatic analysis and synthesis of oral speech are being built (Russia, USA, Japan and others). etc.), research is underway in the field of constructing natural language models.

An important methodological problem of applied computational linguistics is the correct assessment of the necessary relationship between the declarative and procedural components of automatic text information processing systems. What should be preferred: powerful computational procedures based on relatively small vocabulary systems with rich grammatical and semantic information, or a powerful declarative component with relatively simple computer interfaces? Most scientists believe that the second way is preferable. It will lead to the achievement of practical goals faster, since there will be fewer dead ends and difficult obstacles to overcome, and here it will be possible to use computers on a larger scale to automate research and development.

The need to mobilize efforts, first of all, on the development of the declarative component of automatic text information processing systems is confirmed by half a century of experience in the development of computer linguistics. After all, here, despite the undeniable successes of this science, the passion for algorithmic procedures has not brought the expected success. There was even some disappointment in the capabilities of procedural means.

In light of the above, it seems promising to develop such a path of development of computer linguistics, when the main efforts will be aimed at creating powerful dictionaries of language and speech units, studying their semantic-syntactic structure and creating basic procedures for morphological, semantic-syntactic and conceptual analysis and synthesis of texts. This will allow us to decide in the future wide range applied problems.

Computer linguistics faces, first of all, the tasks of linguistic support for the processes of collecting, accumulating, processing and retrieving information. The most important of them are:

1. Automation of the compilation and linguistic processing of machine dictionaries;

2. Automation of the processes of detecting and correcting errors when entering texts into a computer;

3. Automatic indexing of documents and information requests;

4. Automatic classification and abstracting of documents;

5. Linguistic support for information retrieval processes in monolingual and multilingual databases;

6. Machine translation of texts from one natural language to another;

7. Construction of linguistic processors that ensure user communication with automated intelligent information systems (in particular, expert systems) in natural language, or in a language close to natural;

8. Extracting factual information from informal texts.

Let us dwell in detail on the problems most relevant to the topic of research.

In the practical activities of information centers, there is a need to solve the problem of automated detection and correction of errors in texts when they are entered into a computer. This complex task can be conditionally divided into three tasks - tasks of orthographic, syntactic and semantic control of texts. The first of them can be solved using a morphological analysis procedure that uses a fairly powerful reference machine dictionary of word stems. In the process of spelling control, the words of the text are subject to morphological analysis, and if their bases are identified with the bases of the reference dictionary, then they are considered correct; if they are not identified, then they, accompanied by a microcontext, are presented to a person for viewing. A person detects and corrects distorted words, and the corresponding software system makes these corrections into the corrected text.

The task of syntactic control of texts in order to detect errors in them is much more difficult than the task of spelling control. Firstly, because it includes the task of spelling control as its obligatory component, and, secondly, because the problem parsing unformalized texts have not yet been fully resolved. However, partial syntactic control of texts is quite possible. Here you can go in two ways: either compile fairly representative machine dictionaries of reference syntactic structures and compare the syntactic structures of the analyzed text with them; or develop complex system rules for checking the grammatical consistency of text elements. The first path seems to us more promising, although it, of course, does not exclude the possibility of using elements of the second path. The syntactic structure of texts should be described in terms of grammatical classes of words (more precisely, in the form of sequences of sets grammatical information to words).

The task of semantic control of texts in order to detect semantic errors in them should be classified as a class of artificial intelligence tasks. It can be solved in full only on the basis of modeling the processes of human thinking. In this case, it will apparently be necessary to create powerful encyclopedic knowledge bases and software tools for knowledge manipulation. Nevertheless, for limited subject areas and for formalized information, this task is completely solvable. It should be posed and solved as a problem of semantic-syntactic control of texts.

The problem of automating the indexing of documents and queries is traditional for automated text information retrieval systems. At first, indexing was understood as the process of assigning classification indices to documents and queries that reflected their thematic content. Subsequently, this concept was transformed and the term “indexing” began to refer to the process of translating descriptions of documents and queries from natural language into formalized language, in particular, into the language of “search images”. Search images of documents began, as a rule, to be drawn up in the form of lists of keywords and phrases reflecting their thematic content, and search images of queries - in the form of logical structures in which keywords and phrases were connected to each other by logical and syntactic operators.

It is convenient to automatically index documents based on the texts of their abstracts (if any), since abstracts reflect the main content of documents in a concentrated form. Indexing can be carried out with or without thesaurus control. In the first case, in the text of the title of the document and its abstract, key words and phrases of the reference machine dictionary are searched and only those that are found in the dictionary are included in the AML. In the second case, key words and phrases are isolated from the text and included in the POD, regardless of their belonging to any reference dictionary. A third option was also implemented, where, along with terms from the machine thesaurus, the AML also included terms extracted from the title and first sentence of the document abstract. Experiments have shown that PODs compiled automatically using titles and abstracts of documents provide greater search completeness than PODs compiled manually. This is explained by the fact that the automatic indexing system more fully reflects various aspects document content than a manual indexing system.

Automatic indexing of queries poses approximately the same problems as automatic indexing of documents. Here you also have to extract keywords and phrases from the text and normalize the words included in the query text. Logical connectives between keywords and phrases and contextual operators can be inserted manually or using an automated procedure. An important element of the process of automatic indexing of a query is the addition of its constituent keywords and phrases with their synonyms and hyponyms (sometimes also hyperonyms and other terms associated with the original query terms). This can be done automatically or interactively using a machine thesaurus.

We have already partially considered the problem of automating the search for documentary information in connection with the task of automatic indexing. The most promising here is to search for documents using their full texts, since the use of all kinds of substitutes for this purpose (bibliographic descriptions, search images of documents and the texts of their abstracts) leads to loss of information during the search. The greatest losses occur when bibliographic descriptions are used as substitutes for primary documents, and the smallest losses occur when abstracts are used.

Important characteristics of the quality of information retrieval are its completeness and accuracy. The completeness of the search can be ensured by taking maximum account of the paradigmatic connections between units of language and speech (words and phrases), and accuracy - by taking into account their syntagmatic connections. There is an opinion that the completeness and accuracy of a search are inversely related: measures to improve one of these characteristics lead to a deterioration in the other. But this is only true for fixed search logic. If this logic is improved, then both characteristics can be improved simultaneously.

It is advisable to build the process of searching for information in full-text databases as a process of interactive communication between the user and the information retrieval system (IRS), in which he sequentially views text fragments (paragraphs) that satisfy the logical conditions of the request, and selects those that are relevant to him. are of interest. The final search results may appear as full texts documents and any fragments thereof.

As can be seen from the previous discussions, when automatically searching for information, it is necessary to overcome the language barrier that arises between the user and the information system due to the variety of forms of representation of the same meaning that occurs in texts. This barrier becomes even more significant if the search has to be carried out in multilingual databases. A radical solution to the problem here could be machine translation of document texts from one language to another. This can be done either in advance, before loading documents into a search engine, or during the process of searching for information. In the latter case, the user's request must be translated into the language of the document array in which the search is being conducted, and the search results must be translated into the language of the request. This kind of search engines already operate on the Internet. VINITI RAS also built a Cyrillic Browser system, which allows you to search for information in Russian-language texts using queries in English with search results also in the user’s language.

An important and promising task of computer linguistics is the construction of linguistic processors that ensure user communication with intelligent automated information systems (in particular, expert systems) in natural language or in a language close to natural. Since in modern intelligent systems information is stored in a formalized form, linguistic processors, acting as intermediaries between a person and a computer, must solve the following main tasks: 1) the task of transitioning from the texts of input information requests and messages in natural language to representing their meaning in a formalized language (when entering information into a computer); 2) the task of transition from a formalized representation of the meaning of output messages to its representation in natural language (when issuing information to a person). The first task must be solved by morphological, syntactic and conceptual analysis of input queries and messages, the second - by conceptual, syntactic and morphological synthesis of output messages.

Conceptual analysis of information requests and messages consists of identifying their conceptual structure (the boundaries of the names of concepts and relationships between concepts in the text) and translating this structure into a formalized language. It is carried out after morphological and syntactic analysis of requests and messages. The conceptual synthesis of messages consists of the transition from the representation of the elements of their structure in a formalized language to a verbal (verbal) representation. After this, the messages are given the necessary syntactic and morphological format.

For machine translation of texts from one natural language to another, it is necessary to have dictionaries of translation correspondence between the names of concepts. Knowledge about such translation correspondences was accumulated by many generations of people and formalized in the form special editions– bilingual or multilingual dictionaries. For specialists who have some knowledge of foreign languages, these dictionaries served as valuable aids in translating texts.

In traditional bilingual and multilingual general-purpose dictionaries, translation equivalents were indicated primarily for individual words, and for phrases - much less often. Indication of translation equivalents for phrases was more typical for special terminological dictionaries. Therefore, when translating sections of texts containing polysemantic words, students often encountered difficulties.

Below are translation correspondences between several pairs of English and Russian phrases on “school” topics.

1) The bat looks like a mouse with wings – Bat looks like a mouse with wings.

2) Children like to play in the sand on the beach - Children love to play in the sand on the seashore.

3) A drop of rain fell on my hand - A drop of rain fell on my hand.

4) Dry wood burns easily - dry wood burns well.

5) He pretended not to hear me - He pretended not to hear me.

Here English phrases are not idiomatic expressions. However, their translation into Russian can only with some stretch be considered as a simple word-by-word translation, since almost all the words included in them are ambiguous. Therefore, only the achievements of computer linguistics can help students here.

Plan:

1. What is computational linguistics?

2. Object and subject of computational linguistics

4. Problems of computational linguistics

5. Research methods for computational linguistics

6. History and reasons for the emergence of computational linguistics

7. Basic terms of computational linguistics

8. Scientists working on the problem of computational linguistics

9. Associations and conferences on computational linguistics

10. Literature used.


Computational linguistics – independent direction to applied linguistics, focused on using computers to solve problems involving the use of natural language. (Shchilikhina K.M.)


Computational linguistics– being one of the areas of applied linguistics, she studies the linguistic foundations of computer science and all aspects of the connection between language and thinking, modeling language and thinking in a computer environment using computer programs, and her interests lie in the areas of: 1) optimization of communication based on linguistic knowledge 2) creation natural language interface and typologies of language understanding for human-machine communication 3) creation and modeling of information computer systems (Sosnina E.P.)


Object of Computational Linguistics– analysis of language in its natural state as it is used by people in various communication situations, and how the features of language can be formulated.


Tasks of computational linguistics:


Computational linguistics research methods:

1. modeling method- a special object of study that is not available through direct observation. According to the definition of mathematician K. Shannon, a model is a representation of an object in some form that is different from the form of its real existence.

2. knowledge representation theory method implies methods of representing knowledge that are oriented towards automatic processing by modern computers.

3. programming language theory method(programming language theory) is a field of computer science concerned with the design, analysis, characterization, classification and study of programming languages individual characteristics.


Reasons for the emergence of computational linguistics

1. The emergence of computers

2. The problem of communicating with computers of untrained users


1.Dictionary search system developed at Birkbeck College in London in 1948.

2. Warren Weaver Memorandum

3. The beginning of the introduction of the first computers in the field of machine translation

4. Georgetown Project in 1954


1. ALPAC (Automatic Language Processing Advisory Committee) 2. a new stage in the development of computer technologies and their active use in linguistic tasks 3. the creation of a new generation of computers and programming languages ​​4. increasing interest in machine translation 60

-70s of the twentieth century


Late 80s – early 90s of the twentieth century

    The emergence and active development of the Internet

  • Rapid growth in the volume of text information in electronic form

  • The need for automatic processing of texts in natural language


1. Products of PROMT and ABBY (Lingvo) 2. Machine translation technologies 3. Translation Memory technologies

Modern commercial systems

  • Reviving texts

  • Communication models

  • Computer lexicography

  • Machine translate

  • Corpus of texts


Natural language text analysis

3 levels of text structure:
  • Surface syntactic structure

  • Deep syntactic structure

  • Semantic level


The problem of synthesis is the reverse of that in analysis

Bringing text to life

1. Exchange of texts through visual images on the display screen

2. 2 modalities of human thinking: symbolic and visual.


1. Imitation of the communication process 2. Creation effective model dialogue Communication models


Hypertext- a special way of organizing and presenting text, in which several texts or fragments of text can be interconnected by various types connections.


Differences between hypertext and traditional text

Hypertext

    1. processing of spoken language

  • 2. processing of written text


Spoken speech processing

1. automatic speech synthesis

A) the development of text-to-speech synthesizers. Includes 2 blocks: linguistic text processing block And acoustic synthesis block.

2. automatic recognition speeches


1) text recognition

2) text analysis

3) text synthesis


IRS (information retrieval system)– these are software systems for storing, searching and issuing information of interest.

Zakharov V.P. believes that IPS is an ordered set of documents and information technologies designed for storing and retrieving information - texts or data.


3 types of IPS

3 types of IPS

    Manual- This is a search in the library.

  • Mechanized IPS are technical means that ensure the selection of the necessary documents

  • Automatic- searching for information using computers


Computer lexicography

Computer lexicography– one of the important areas of applied linguistics, deals with the theory and practice of compiling dictionaries.

There are 2 directions in lexicography:
  • Traditional lexicography compiles traditional dictionaries

  • Machine lexicography deals with automation of dictionary preparation and solves problems of developing electronic dictionaries


Tasks of computer lexicography

  • Automatically obtaining various dictionaries from text

  • Creation of dictionaries that are electronic versions of traditional dictionaries or complex electronic linguistic dictionaries for traditional dictionary work, for example LINGVO

  • Development of theoretical and practical aspects of compiling special computer dictionaries, for example for information retrieval, machine translation


Machine translate

Machine translate– converting text in one natural language into another natural language using a computer.

Types of machine translation
  • FAMT(Fully Automated Machine Translation) – fully automatic translation

  • HAMT(Human Aided machine Translation) - machine translation with human participation

  • MAHT(Machine Aided Human Translation) – translation carried out by a person using auxiliary software and linguistic tools.


  • 2) professional MP– higher quality translation followed by human editing

  • 3) interactive MP– is considered a translation in special support systems, takes place in dialogue mode with computer system. The quality of MP depends on customization options, resources, and type of texts.

Corpus of texts

Corpus of texts- this is a certain collection of texts, which is based on a logical concept, a logical idea that unites these texts.

Language corpus- a large, electronically presented, unified, structured, labeled, philologically competent array of language data designed to solve specific linguistic problems.


Representativeness is the most important property of a corpus


The purpose of a language corpus is to show the functioning of linguistic units in their natural contextual environment



Based on the corpus, you can obtain the following data:

1. about the frequency of grammatical categories

2. about frequency changes

3. about changes in contexts in different periods of time

5. about the co-occurrence of lexical units

6. about the features of their compatibility


Brown Corps


Corpus of texts - this is a certain collection of texts, which is based on a logical concept, a logical idea that unites these texts. The embodiment of this logical idea: rules for organizing texts into a corpus; algorithms and programs for analyzing a corpus of texts; associated ideology and methodology. National Corps is given language at a certain stage (or stages) of its existence and in all the diversity of genres, styles, territorial and social options, etc. Basic terms of computational linguistics

    Programming languages (LP) is a class of artificial languages ​​designed for processing information using a computer. Any programming language is a strict (formal) sign system with the help of which computer programs are written. According to various estimates, there are currently between a thousand and ten thousand different languages programming.

  • Computer science(Computer Science) - the science of the patterns of recording, storing, processing, transmitting and using information using technical means.



Search for information (Information Retrieval) is the process of finding such documents (texts, records and

etc.) that correspond to the received request.

« Information retrieval system (IPS) is an ordered set of documents (arrays of documents) and information technologies designed for storing and retrieving information - texts (documents) or data (facts).

Machine lexicography(Computational Lexicography) deals with the automation of the preparation of dictionaries and solves the problems of developing electronic

dictionaries.

Machine translate is the computer's transformation of text on one

natural language into content-equivalent text in another

natural language.

Hypertext is a technology for organizing information and specially structured text, divided into separate blocks, having a non-linear presentation, for effective presentation of information in computer environments.


    Frame is a structure for representing declarative knowledge about a typified thematically unified situation, i.e. structure of data about a stereotypical situation.

  • Scenario - this is a sequence of several episodes in time, this is also a representation of a stereotypical situation or stereotypical behavior, only the elements of the scenario are steps of an algorithm or instructions.
  • Plan – representation of knowledge about possible actions that are necessary to achieve a certain goal.



Scientists in the field of computational linguistics:

  • Soviet and Russian scientists: Alexey Lyapunov, Igor Melchuk, Olga Kulagina, Yu.D. Apresyan, N.N. Leontyeva, Yu.S. Martemyanov, Z.M. Shalyapina, Igor Boguslavsky, A.S. Narignani, A.E. Kibrik, Baranov A.N.

  • Western scientists Stars: Yorick Wilks, Gregory Grefenstette, Gravil Corbett, John Carroll, Diana McCarthy, Luis Marquez, Dan Moldovan, Joakim Nivre, Victor Raskin, Eduard Hovey.


Associations and conferences in computational linguistics:
  • "Dialogue"- the main Russian conference on computational linguistics with international participation.

The priority of the Dialogue is computer modelling Russian language. The working languages ​​of the conference are Russian and English. To attract foreign reviewers, the bulk of applied work is submitted in English.

Main directions of the conference:
  • Linguistic semantics and semantic analysis

  • Formal language models and their applications

  • Theoretical and computer lexicography

  • Methods for evaluation of text analysis and machine translation systems

  • Corpus linguistics. Creation, application, evaluation of corpora

  • Internet as a linguistic resource. Linguistic technologies on the Internet

  • Ontologies. Knowledge extraction from texts

  • Computer analysis of documents: abstracting, classification, search

  • Automatic sentiment analysis of texts

  • Machine translate

  • Models of communication. Communication, dialogue and speech act

  • Analysis and speech synthesis



2. Association for Computational Linguistics (ACL) is an international scientific and professional society people working on problems involving natural language and computing. The annual meeting is held every summer in locations where significant computational linguistics research is being carried out. Founded in 1962, originally named Association for Machine Translation and Computational Linguistics (AMTCL). In 1968 it became ACL.
  • UACL has a European one (EACL) and North American (NAACL) branches.

  • ACL Journal, Computational linguistics, is the premier forum for research in computational linguistics and natural language processing. Since 1988 the magazine has been published for ACL MIT Press.
  • ACL Book Series, Research in Natural Language Processing, published Cambridge University Press.

  • Every year ACL and its chapters organize international conferences in different countries.

ACL 2014 was held in Baltimore, USA.

  • References:

  • 1. Marchuk Yu.N. Computer linguistics: textbook/Yu.N. Marchuk.- M.:AST: East-West, 2007-317 p.

  • 2. Shilikhina K.M. Fundamentals of applied linguistics: textbook for specialty 021800 (031301) - Theoretical and applied linguistics, Voronezh, 2006.

  • 3. Boyarsky K.K. Introduction to computational linguistics. Textbook. - St. Petersburg: NRU ITMO, 2013. - 72 p.

  • 4. Shchipitsina L.Yu. Information technologies in linguistics: textbook / L.Yu. Shchipitsina.- M.: FLINTA: science, 2013.- 128 p.

  • 5. Sosnina E.P. Introduction to applied linguistics: textbook / E.P. Sosnina. - 2nd ed., revised. and additional – Ulyanovsk: Ulyanovsk State Technical University, 2012. -110 p.

  • 6. Baranov A.N. Introduction to applied linguistics: Textbook. - M.: Editorial URSS, 2001. - 360 p.

  • 7. Applied linguistics: Textbook / L.V. Bondarko, L.A. Verbitskaya, G.Ya. Martynenko and others; Rep. Editor A.S. Gerd. St. Petersburg: publishing house St. Petersburg. Univ., 1996.- 528 p.

  • 8. Shemyakin Yu.I. Beginnings of computer linguistics: Textbook. M.: Publishing house MGOU, JSC "Rosvuznauka", 1992.

  • Computer linguists develop text recognition algorithms and sounding speech, the synthesis of artificial speech, the creation of semantic translation systems and the very development of artificial intelligence (in the classical sense of the word - as a replacement for human intelligence - it is unlikely to ever appear, but various expert systems based on data analysis).

    Speech recognition algorithms will be increasingly used in everyday life - smart homes and electronic devices will not have remote controls and buttons, but instead a voice interface will be used. This technology is being refined, but there are still many challenges: it is difficult for a computer to recognize human speech because different people speak very differently. Therefore, as a rule, recognition systems work well either when they are trained for one speaker and are already adjusted to his pronunciation features, or when the number of phrases that the system can recognize is limited (as, for example, in voice commands for a TV).

    Specialists in creating semantic translation programs still have a lot of work ahead: at the moment, good algorithms have been developed only for translation into and from English. There are many problems here - different languages ​​are structured differently semantically, this differs even at the level of constructing phrases, and not all meanings of one language can be conveyed using the semantic apparatus of another. In addition, the program must distinguish homonyms, correctly recognize parts of speech, and select the correct meaning of a polysemantic word that fits the context.

    Synthesis of artificial speech (for example, for home robots) is also painstaking work. It is difficult to make artificially created speech sound natural to human ear, because there are millions of nuances that we don’t pay attention to, but without which everything is no longer “the same” - false starts, pauses, hitches, etc. The speech flow is continuous and at the same time discrete: we speak without pausing between words, but it is not difficult for us to understand where one word ends and another begins, but for a machine this would be a big problem.

    The largest direction in computational linguistics is related to Big Data. After all, there are huge corpuses of texts such as news feeds, from which it is necessary to isolate certain information - for example, highlight news feeds or tailor RSS to the tastes of a particular user. Such technologies already exist and will continue to develop, because computing power is growing rapidly. Linguistic analysis texts are also used to ensure Internet security and search for necessary information for intelligence services.

    Where to study to become a computer linguist? In our country, unfortunately, the specialties related to classical linguistics and programming, statistics, and data analysis are quite separated. And in order to become a digital linguist, you need to understand both. Foreign universities have higher education programs in computer linguistics, but for now the best option for us is to get a basic linguistic education and then master the basics of IT. It’s good that now there are many different online courses; unfortunately, this was not the case during my student years. I studied at the Faculty of Applied Linguistics at Moscow State Linguistics University, where we had courses on artificial intelligence and speech recognition - but still not in sufficient volume. Now IT companies are actively trying to interact with institutions. My colleagues from Kaspersky Lab and I also try to participate in the educational process: we give lectures, hold student conferences, and give grants to graduate students. But so far the initiative comes more from employers than from universities.