Researchers in NLP and more specifically in IR have made extensive used of precision (P) and recall (R) to evaluate their systems. Widely used definitions for P and R are as follows:
P = relevant items system got correct / total number of items system produced or generated

R = relevant items system got correct / total number of relevant items (which the system should have produced)

Now, if we think about evaluating an MT system, the items are the translations and so precision is straightforward to calculate. P = number of correct translations produced by the system / total number of translations produced by the system.

But who do we calculate recall? The numerator is the same as for P (number of correct translations generated by the system), but how does one determine what is the number of relevant translations? This is almost a philosophical question, since there is not just one (set of) translation(s) that is correct given a SL sentence. Unless there is a fixed set of reference translations (often used by the MT community to evaluate systems with automatic metrics such as BLEU and METEOR), then there is no way to know a priori what is the number of possible translations given a SL sentence.

And this is indeed how most people use recall to evaluate MT systems, taking a set of references as their absolute truth of what is possible and relevant for any sentence. Melamed et al. 2003 define both P and R as a conditioned by a set of references X given for a particular test set, and so if Y is the set of translation candidates generated by the system, they define:

Precision(X|Y) = |X ∩Y| / |Y|

Recall (Y|X) = |X ∩Y| / |X|

Multiple References: “One of the main sources of variance in MT evaluation measures is the multiple ways to express any given concept in natural language. A candidate translation can be prefectly correct but very different from a given reference translation. One approach to reducing this source of variance, and thereby improving reliability of MT evaluation, it to use multiple references (Thompson, 1991).”

I can see this is a practical way to solve this, but I guess I have my four years of translation training to blame for my resistance to just accept this interpretation of the recall measure.


by John Lee and Stephanie Seneff @ Spoken Language Systems, MIT CSAIL
Interspeech – ICSLP (Pittsburgh) 17-21 September

Taken from Interspeech website:

Session Wed3A3O: Technologies for Specific Populations: Learners and Challenged
it’s a poster
A computer conversational system can potentially help a foreign-language student improve his/her fluency through practice dialogues. One of its potential roles could be to correct ungrammatical sentences. This paper describes our research on a sentence-level, generation-based approach to grammar correction: first, a word lattice of candidate corrections is generated from an ill-formed input. A traditional n-gram language model is used to produce a small set of N-best candidates, which are then reranked by parsing using a stochastic context-free grammar. We evaluate this approach in a flight domain with simulated ill-formed sentences. We discuss its potential applications in a few related tasks.

Notes: They take a couple of error categories relevant to Japanese speakers conversing in English (articles and prepositions, noun number, verb aspect, mode and tense) and use them for their experiments/analysis. They do not use data from real second-language learners for this paper.

First they reduce the supposedly erroneous sentence (in my case it would be incorrect MT output) to its canonical form, where articles, preps, and auxiliaries are stripped off, and nouns and verbs are reduced to their citation form. All their alternative inflections are inserted into the lattice; insertions of articles, preps and aux. are allowed at every position. Second, an n-gram and a stochastic CFG are used as LMs to score all the paths in the lattice. In their experiments, they treat the transcript as a gold-standard and they find that their method can correctly reconstruct the transcript 88.7% of the time.
What’s nice about this approach is that it doesn’t need any human corrections. In a way, my thesis research can be seen as a great source of data to train systems similar to this one. A nice side-effect of my research is that we obtain MT output annotated with human corrections. so in this setting, one can use correction annotated data in order to build systems that can recover from ill-formed MT output and generate correct translations for such output automatically.

Nizar Habash (Columbia University)’s contribution to the AMTA Hybird MT Panel.

The Intuition: StatMT and RuleMT have complementary advantages:
Syntactic structure produces better global target linguistic structure,
Statistical phrase-based translation is more robust locally.

The Resource Challenge
Parallel corpora as models of performance vs. Dictionaries/analyzers as models of competence
“More is better” is true for both approaches

Parallel corpora are domain/genre specific
Dictionaries and parsers can be domain/genre specific

Hybrids may need more data: Annotated resources.

Federico Gaspari (F.Gaspari @ from University of Manchester, United Kingdom:

• Social impact of MT very visible on the Internet

• Only small minority of language supported

• Online MT has established a niche for itself

• Online MT promotes social interchange

• Users prepared to accept low-quality output

• Human translation simply not an option

Tsunami webpage to help find/identify victims in English translated into many languages with online MT systems such as Google and Altavista: and

Michael McCord (mcmccord @ from IBM Research:
Two social impact projects, sponsored by IBM Corporate Community Relations (CCR) and IBM Research:

1. ¡Tradúcelo Ahora!(Translate it Now): English↔Spanish MT for Latinos.
Server-based: Users need not install anything.
Web page translation. Uses enhancement of IBM product WebSphere Translation Server (WTS).
Email translation. Using any email client, and without installing any software, a user simply writes an email to anyone and copies a certain email account on our server. The email gets translated and sent to the user’s recipients and the user. Handles either Es or En source, and these can be mixed (does language ID).
Smart cross-lingual web search.
Work done by Nelson Correa and Esmé Manandise, M. McCord

To address the Hispanic Digital Divide, CCR has been working in partnership with nearly three dozen major agencies serving the Latino community since 2004.
These agencies receive grants from CCR, use the TA software, and give us feedback for improving the En-Es MT.
This year we are continuing that work, and also working with K-12 schools – doing web page translation, and translation for email between (mainly) Spanish-speaking parents and English-speaking school staff.

A study by the Tomás Rivera Policy Institute concluded that the TA project has benefited the participant organizations and their constituents in significant ways:
It simplified community outreach specialists’ efforts to conduct educational sessions on medical disorders for Spanish-speaking clients;
It enabled staff to more easily research online information about public services, jobs, clinical and legal issues, and translate the web pages for their clients;
It enriched English as a Second Language (ESL) program educational resources; It augmented and improved Spanish literacy courses;
It made it easier for clients to find employment at popular job search web sites, helped them apply for jobs online, and write resumes and cover letters;
It provided GED and ESL students a significant new tool for conducting research, reading the news, viewing transcripts, etc., and
It provided an additional teaching resource to enhance basic computer-training courses.

2. Cooperation with Meadan on English
Chat/blog system to foster Western-Islamic dialog

CCR and other parts of IBM are cooperating with the Meadan organization (Ed Bice et al.) to build this system. IBM is contributing mainly certain technical pieces: Arabic↔English MT. Salim Roukos’ group. Arabic Slot Grammar parser. McCord, Cavalli-Sforza. Uses Buckwalter’s BAMA for morphology. Will be used to: improve Ar→En MT + analyze Arabic text entries directly to make them into a searchable database (also ESG used for English entries). Parts of networking platform (IBM group in England).

Is MT a necessity for social justice in a multi-ethnic society?
Certainly translation is. MT should help when there aren’t enough human translators, and the MT is good enough.

Rami B. Safadi (safadi @ from Sakhr Software USA. Social Impact of Translation Via SMS:

User sents message to be translated dialing a number (#2020), MT Server translates message and sents it back.

Motivation: For Sakhr Software: Revenues per message translated + Develop a dialect preprocessor. For Mobile phone companies: Value added services to retain customers + Free service.

English to Arabic (80%)
Over 50% Mobile advertisements & subscriptions
About 25% Dictionary, expressions, terminologies and short phrases
About 20% Chatting
About 5% Notifications for Bank accounts, Credit Cards, Prepaid cards….

Arabic to English (20%)
Over 70% Chatting
30% Dictionary, expressions, terminologies and short phrases

Available in 11 countries
Over 10,000 messages per day

Win Laptops, Mp3 players & more!.. Join the Al Shamil Quiz Competition from 3 – 9 August; 5pm – 9pm at the Mall of the Emirates. (School Students only)
Sorry the transferred failed. You do not have sufficient credit.
Tell me ur coming or no i have duty 7 am

… when I took a look at Ed Bice’s slides for the AMTA Social Impact of MT Panel. Ed Bice is the founder of Meadan (ebice @, among many other things (his Pop web page).

hybrid distributed natural language translation (hdnlt) ‘web 2.0’ approach
• Language translation as a distributed service
• People/machines collaborate to provide service
• Volunteer translators as a social network
• Harness collective intelligence – value arises from small, shared
• Reputation driven – translator reputations adjusted by feedback
and performance
• Abstractions ease adding devices and services

In 2004, less than 1% of the 6800 languages of the world profits from a high level of computerization, including a broad range of services going from text processing to machine translation. This thesis, which focuses on the other languages – the pi-languages – aims at proposing solutions to cure their digital underdevelopment. In a first part, intended to show the complexity of the problem, we present the languages’ diversity, the technologies used, as well as the approaches of the various actors: linguistic populations, software publishers, the United Nations, States… A technique for measuring the computerization degree of a language – the sigma-index – is proposed, as well as several optimization methods. The second part deals with the computerization of the Laotian language and concretely presents the results obtained for this language by applying the methods described previously. The described achievements contributed to improve the sigma-index of the Laotian language by approximately 4 points, this index being currently evaluated with 8.7/20. In the third part, we show that an approach by groups of languages can reduce the computerization costs thanks to the use of a modular architecture associating existing general software and specific complements. For the most language-related parts, complementary generic lingware tools give the populations the possibility to computerize their languages by themselves. We validated this method by applying it to the syllabic segmentation of Southeast Asian languages with unsegmented writings, such as Burmese, Khmer, Laotian and Siamese (Thai).

From an email Christian Boitet sent to

1) On the terms tau-, mu-, pi- languages and pairs of languages

The point is to CHARACTERIZE in an EXACT an NON-DEPRECATING way languages and pairs of languages for which there is a lack of computerized resources and tools used or directly usable in NLP applications concerning them.

By the way, I forgot to include “pi-pairs” in the previous e-mail, but they do exist.

A pi-pair of languages is a pair for which NLP-related data, resources and tools are lacking. [pi=poorly informatisées?].

read more about that here:
Méthodes pour informatiser les langues et les groupes de langues «peu dotées» by Vincent Berment
Berment also uses the terms:
– tau-language (pair) = well (totally / très bien) equipped
– mu-language (pair) = medium (moyennement bien) equipped

Example: while French and Thai are reasonably “NLP-equipped” (tau-language and mu-language), the 2 pairs FT, TF are not.

Example: Spanish is a tau-language, Catalan and Galician are mu or pi languages, the pairs SC and SG are Tau-pairs because there are 2 quite good MT systems translating newspapers ofr these pairs (Comprendium, using the METAL shell, see Proc.

2) Other terms proposed and why they are not good terms for these concepts

The terms

* minority languages
* less-prevalent languages
* less(er) widely used languages
* less-dominant (non-dominant) languages
* traditionally oral/spoken/unwritten languages
* endangered languages
* indigenous languages
* neglected languages
* New Member State languages (used for the new languages of the European Union)

don’t really say anything about the degree of “equipement” as far as computer applications are concerned, and many of them are deprecating in some way.

(I agree 100% with Jeff Allen on that!)

The terms

* sparse-data languages
* low-density languages

also don’t fit:

– The idea that data is “sparse” means there ARE data, but in fragmentary and heterogeneous form.
But pi-languages often have NO data or resources usable, even for simple applications such as hyphenation — where are “sparse data” for hyphenating khmer?

– “Low-density” is quite worse as it can only mean that a language is spoken by a small fraction of the population where it is spoken.
But what can be the reference? A country? A region? — To the extreme, almost any language is of high density in families where it is spoken.

About the 2 other terms proposed:

* commercially disadvantaged/inhibited/challenged languages
* low market-value languages

These terms also miss the point above. A language may suddenly acquire a high market value (see Chinese since 10-15 years), or lose it somewhat (e.g., Russian since 1991), this is independent of the resources and tools existing for it. The reason is that these are often developed NOT in order to build commercial products. Why were Eurodicautom, Euramis and EuroParl developed?

When NLP firms will discover than
Malay/Indonesian can be commercially interesting, they will find there are quite a lot of resources for them, including a modern unified terminology (istilah). But if the same happens for tagalog (or maybe for swahili), they will find next to nothing usable to quickly build applications for them.

Best regards,