Machine translation
From My Wiki
|
Where is machine translation at?
Most people in our group have used machine translation like Google translate and Babelfish quite often: in general they don't work so well!
The output of machine translation is not useful for publishing when languages are very different — although it will usually allow you to understand the "5w's and h" (who, what, why, when, where and how) but that's about it. The distinction between publishing quality and understanding quality is generally termed "translation for assimilation" (understanding quality) and "translation for dissemination" (publishable quality).
But for languages that are quite different (like English and Chinese for example), you'll find errors, words in wrong places, wrong words being used etc.
For close languages (like Spanish and Catalan for example), high (e.g. 95% ) accuracy is possible, resulting in a faster translation. After the machine translation it would not take long to make necessary changes to make it more easily readable and correct errors.
How does machine translation work?
Statistical machine translation
Two large bodies of texts are aligned next to each other, and words and phrases are mapped between the two texts.
In this case, to make a good translation you need at least 1 million strings aligned.
Example of this software: Moses
Statistical translation is good because it's language independent, but sometimes it loses data eg. dates, the systems are also quite slow. Moses for example which can translate up to 13 words per second.
Rule based machine translation
In rule based machine translation, the text is analysed and tagged so for eg. went' is tagged as past tense of 'to go'.
The next step is 'lexical transfer' and then chunking where the lexical information is grouped into phrases eg.' the man'=noun phrase
The next step is to move the chunks of text around into sentences
It’s all about mapping words and relating each other.
Example of this type of system: Apertium which can translate up to 3000 words per second.
Who is funding translation?
Machine translation is funded by many governments and organisations.
The problem though is that government often funds the development of this software, but they make it proprietary, but they should be developing a corpus of work that is freely licensed or in the public domain (PD).
Importantly, if people know of PD content out there it needs to be announced that it is available for translation.
Open/free licensing is really necessary for distribution and further dissemination of the translation, as well as to avoid litigation. Another issue is that rule based systems can make use of dictionaries, but these are copyrighted.
What are the future trends of MT? which one will progress more?
Basically we need a combination of both rule based and statistical in the future
Can we forsee great advances, especially in OSS?
Between distant languages, for translation for dissemination: no, for assimilation: yes
