## Precision and Recall for Machine Translation

### October 19, 2006

Researchers in NLP and more specifically in IR have made extensive used of precision (P) and recall (R) to evaluate their systems. Widely used definitions for P and R are as follows:

P = relevant items system got correct / total number of items system produced or generated

R = relevant items system got correct / total number of relevant items (which the system should have produced)

Now, if we think about evaluating an MT system, the items are the translations and so precision is straightforward to calculate. P = number of correct translations produced by the system / total number of translations produced by the system.

But who do we calculate recall? The numerator is the same as for P (number of correct translations generated by the system), but how does one determine what is the number of relevant translations? This is almost a philosophical question, since there is not just one (set of) translation(s) that is correct given a SL sentence. Unless there is a fixed set of reference translations (often used by the MT community to evaluate systems with automatic metrics such as BLEU and METEOR), then there is no way to know a priori what is the number of possible translations given a SL sentence.

And this is indeed how most people use recall to evaluate MT systems, taking a set of references as their absolute truth of what is possible and relevant for any sentence. Melamed et al. 2003 define both P and R as a conditioned by a set of references X given for a particular test set, and so if Y is the set of translation candidates generated by the system, they define:

Precision(X|Y) = |X ∩Y| / |Y|

Recall (Y|X) = |X ∩Y| / |X|

**Multiple References**: “One of the main sources of variance in MT evaluation measures is the multiple ways to express any given concept in natural language. A candidate translation can be prefectly correct but very different from a given reference translation. One approach to reducing this source of variance, and thereby improving reliability of MT evaluation, it to use multiple references (Thompson, 1991).”

I can see this is a practical way to solve this, but I guess I have my four years of translation training to blame for my resistance to just accept this interpretation of the recall measure.