Precision and Recall of Machine Translation

October 19, 2006

I. Dan Melamed, Ryan Green and Joseph P. Turian. (2003) HLT.
Computer Science Department NYU
Contact: {lastname}@cs.nyu.edu

Abstract: ŽMachine translation can be evaluated using precision, recall, and the F-measure. These standard measures have signicantly higher correlation with human judgments than recently proposed alternatives. More importantly, the standard measures have an intuitive interpretation, which can facilitate insights into how MT systems might be improved. The relevant software is publicly available.

My Notes: they define both P and R as a conditioned by a set of references X given for a particular test set, and so if Y is the set of translation candidates generated by the system, they define:

Precision(X|Y) = |X ∩Y| / |Y|

Recall (Y|X) = |X ∩Y| / |X|

Multiple References: “One of the main sources of variance in MT evaluation measures is the multiple ways to express any given concept in natural language. A candidate translation can be prefectly correct but very different from a given reference translation. One approach to reducing this source of variance, and thereby improving reliability of MT evaluation, it to use multiple references (Thompson, 1991).”

I can see this is a practical way to solve this, but I guess I have my four years of translation training to blame for my resistance to just accept this interpretation of the recall measure. See the random thoughts that lead me to this paper.

Advertisements

One Response to “Precision and Recall of Machine Translation”

  1. Elena Temnova Says:

    It’s rather hard to judge about the methods in question without seeing concrete samples, but I guess the the both approaches can work. The only problem is that only a human can verify the correctness of a translation, and the procedure of evaluating MT system cannot be automatical (otherwise, a computer would be able to manage the task of translation by itself). The first rating is quite intelligible (the proportion of correct translations in a machine-translated text). As for the second one, I can say that an experienced developper of an MT system can tell a context which can be processed by a computer from those who are more difficult and ambiguous. Secondly, we can compare different MT systems, and if, for example, a language construction could be correctly processed by none of them, we can mark it as “hard”, and another construction translated successfully by all the tested software, can be marked as “easy”. Surely, there can be several nuances between these two rating. Then, using a test text containing the construction of the both types, it would be a simple tast to evaluate the quality of any machine-translated text.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: