Can a good Language Model eliminate the need for monolingual post-editing?

August 28, 2006

In theory, if we had a very good LM trained on huge amounts of data, the kind of errors that can be corrected by a monolingual posteditor (GALE), which are generation, fluency errors, should be already taken care of by such LM, right?

The problem is that even LMs trained on really large datasets face sparseness problems for high-order models. From a practical point of vew, since the number of parameters of an n-gram model is O(|W|^n), finding the resources to compute and store all these parameters becomes a hopeless task for n > 5. Or even if we did (read Google did), in actual text, the majority of n-grams that one sees are bigrams or at most trigrams, and it’s very rear to see very high n-grams.

Therefore, current LMs, in spite having smoothed and improved MT output, still generate disfluencies.

An aside: I love the term lexical miopia and shortsightedness to describe low n-gram models (Beeferman et al. 1997).


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: