by Davide Turcato, Fred Popowich, Paul McFetridge, Devlan Nicholson, Janine Toole. NAACL-ANLP 2000 Workshop on Embedded machine translation systems – Volume 5. Seattle, Washington. pp 38-45

Abstract: We describe an approach to Machine Translation of transcribed speech, as found in closed captions. We discuss how the colloquial nature and input format peculiarities of closed captions are dealt with in a pre-processing pipeline that prepares the input for effective processing by a core MT system. In particular, we describe components for proper name recognition and input segmentation. We evaluate the contribution of such modules to the system performance. The described methods have been implemented on an MT system for translating English closed captions to Spanish and Portuguese.

My Notes: Instead of splitting long sentences into shorter, more translation-friendly sentences, in closed captions, the sentences are often arbitrarily split for practical reasons, which makes parsing and Noun Entity recognition much harder.

During pre-processing, MT input undergoes the following steps: text normalization, tokenization, POS tagging, Proper name recognition and segmentation.

Excerpts:

Segmentation breaks a line into one or more segments, which are passed separately to subsequent modules (Ejerhed, 1996) (Beeferman et al., 1997). In translation, segmentation is applied to split a line into a sequence of translationally self-contained units (Lavie et al., 1996).

In our system, the translation unitswe identify are syntactic units, motivated by crosslinguistic considerations. Each unit is a constituent that dan be translated independently. Its translation is insensitive to the context in which the unit occurs, and the order of the units is preserved by translation. One motivation for segmenting is that processing is faster: syntactic ambiguity is reduced, and backtracking from a module to a previous one does not involve re-processing an entire line, but only the segment that failed. A second motivation is robustness: a failure in one segment does not involve a failure in the entire line, and error-recovery can be limited only to a segment. Further motivations are provided by the colloquial nature of closed captions.

Leave a Reply