Pre-processing closed captions for machine translation

August 27, 2006

by Davide Turcato, Fred Popowich, Paul McFetridge, Devlan Nicholson, Janine Toole. NAACL-ANLP 2000 Workshop on Embedded machine translation systems – Volume 5. Seattle, Washington. pp 38-45

Abstract: We describe an approach to Machine Translation of transcribed speech, as found in closed captions. We discuss how the colloquial nature and input format peculiarities of closed captions are dealt with in a pre-processing pipeline that prepares the input for effective processing by a core MT system. In particular, we describe components for proper name recognition and input segmentation. We evaluate the contribution of such modules to the system performance. The described methods have been implemented on an MT system for translating English closed captions to Spanish and Portuguese.

My Notes: Instead of splitting long sentences into shorter, more translation-friendly sentences, in closed captions, the sentences are often arbitrarily split for practical reasons, which makes parsing and Noun Entity recognition much harder.

During pre-processing, MT input undergoes the following steps: text normalization, tokenization, POS tagging, Proper name recognition and segmentation.


Segmentation breaks a line into one or more segments, which are passed separately to subsequent modules (Ejerhed, 1996) (Beeferman et al., 1997). In translation, segmentation is applied to split a line into a sequence of translationally self-contained units (Lavie et al., 1996).

In our system, the translation unitswe identify are syntactic units, motivated by crosslinguistic considerations. Each unit is a constituent that dan be translated independently. Its translation is insensitive to the context in which the unit occurs, and the order of the units is preserved by translation. One motivation for segmenting is that processing is faster: syntactic ambiguity is reduced, and backtracking from a module to a previous one does not involve re-processing an entire line, but only the segment that failed. A second motivation is robustness: a failure in one segment does not involve a failure in the entire line, and error-recovery can be limited only to a segment. Further motivations are provided by the colloquial nature of closed captions.


One Response to “Pre-processing closed captions for machine translation”

  1. Hi there! Quick question that’s totally off
    topic. Do you know how to make your site mobile friendly?
    My weblog looks weird when viewing from my apple iphone.
    I’m trying to find a theme or plugin that might be able to resolve this problem.
    If you have any recommendations, please share. With thanks!

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: