How the machine does translation
From My Wiki
This is an intro to how rule-based machine translation systems work. The examples given are from Apertium.
First Step: Select source and target languages of translation
In this case: Serbian > English
Second step: Write phrase or sentence to be translated:
"Velika reka teče kroz grad"
Third step: Source language analysis
The machine then Identifies the "grammatical features" of each word. There are two types of grammatical features:
1.) Parts of speech (ie. adjective, noun, verb, pronoun, etc.)
2.) Sub-categories (ie. gender of word, case, singular or plural, conjugation, person etc.)
The combination of these "grammatical features" and the lemma of the word is called a "lexical unit". Each surface form of a lemma has its own lexical unit. For example, "velika" can be either a feminine, singular, nominative adjective or a feminine plural nomanative adjective. Or in English, house may be house, "noun, singular" or houses, "noun, plural".
The output of the first stage looks like this:
^Velika/veliki<adj><f><nom><sg>/veliki<adj><f><acc><pl>$ ^reka/reka<n><f><nom><sg>/reka<n><f><gen><pl>$ ^teče/teći<vblex><pres><p3><sg>$ ^kroz/kroz<pr>$ ^grad/grad<n><m><acc><sg>/grad<n><m><nom><sg>$
Lexical units are delimited by '^' and $'. Possible analyses are delimited by '/'
A breakdown of the abbreviations is as follows: adj = adjective, n = noun, pr = preposition, vblex = verb, f = feminine, m = masculine, acc = accusative, nom = nominative, gen = genitive, sg = singular, pl = plural.
Fourth step: Choosing an analysis
A statistical method is used to select the most likely analysis from those available. Linguistic rules can be used to augment this. The output of this stage is a series of lexical units that are the most probably in context:
^veliki<adj><f><nom><sg>$ ^reka<n><f><nom><sg>$ ^teći<vblex><pres><p3><sg>$ ^kroz<pr>$ ^grad<n><m><acc><sg>$
Fifth step: Performing lexical and grammatical transfer
The machine then looks at the 'lemma' of each word in the bilingual dictionary. The 'lemma' of each word is the so-called 'citation form' of that word. So, for example, the lemma of went is 'go'. In the case of our Serbian sentence, the lemma of 'velika' is 'veliki'. This is the form of the word you would look up in a dictionary.
Next, the machine checks its glossary for possible translations of the lemma of the word you are translating. Possible translations of "veliki", for example, are 'big', 'large', and 'great'. In the apertium system the most frequent, or most general translation is taken. Other systems have statistical or rule based methods of word-sense disambiguation.
At the same time as performing lexical transfer, the system also allows for word re-ordering within a sentence. In this case no re-ordering is necessary.
The output of this stage is:
^big<adj>$ ^river<n><sg>$ ^pass<vblex><pres><p3><sg>$ ^through<pr>$ ^city<n><sg>$
Sixth step: Target language generation
From this series of translated lexical units, the target language surface forms are generated, resulting in:
- Big river passes through city.
Seventh step: Post-editting
There is a problem here in that in Serbo-Croatian articles do not exist, so a human would need to post edit this sentence to add them. For example adding:
- The big river passes through the city
Flow chart
Using Translation Memory
After going through the process of how machines typically translate using grammar, we started to talk about the use of translation memory by machine translators. Translation memory refers to vast databases of previously translated phrases and sentences. When the machine translator comes across one of these phrases or sentences, it will pull out the entire translation from the translation memory database.
Fran brought up the difficulty of building translation memory databases due to licensing issues. Google Translate uses a database of translated UN documents because they are released into the public domain.
However, translation memory tends to focus on formal and technical sentences due to the nature of documents that are translated into many languages.
David brought up that there is a vast potential to build translation memory banks using the translations of Global Voices becasue they are direct translations of informal speech. However, again, not all of the content that Global Voices translates is licensed using Creative Commons or in the public domain.
