Pre-processing closed captions for machine translation
August 27, 2006
by Davide Turcato, Fred Popowich, Paul McFetridge, Devlan Nicholson, Janine Toole. NAACL-ANLP 2000 Workshop on Embedded machine translation systems – Volume 5. Seattle, Washington. pp 38-45
Abstract: We describe an approach to Machine Translation of transcribed speech, as found in closed captions. We discuss how the colloquial nature and input format peculiarities of closed captions are dealt with in a pre-processing pipeline that prepares the input for effective processing by a core MT system. In particular, we describe components for proper name recognition and input segmentation. We evaluate the contribution of such modules to the system performance. The described methods have been implemented on an MT system for translating English closed captions to Spanish and Portuguese.
My Notes: Instead of splitting long sentences into shorter, more translation-friendly sentences, in closed captions, the sentences are often arbitrarily split for practical reasons, which makes parsing and Noun Entity recognition much harder.
During pre-processing, MT input undergoes the following steps: text normalization, tokenization, POS tagging, Proper name recognition and segmentation.
Excerpts:
Segmentation breaks a line into one or more segments, which are passed separately to subsequent modules (Ejerhed, 1996) (Beeferman et al., 1997). In translation, segmentation is applied to split a line into a sequence of translationally self-contained units (Lavie et al., 1996).
In our system, the translation unitswe identify are syntactic units, motivated by crosslinguistic considerations. Each unit is a constituent that dan be translated independently. Its translation is insensitive to the context in which the unit occurs, and the order of the units is preserved by translation. One motivation for segmenting is that processing is faster: syntactic ambiguity is reduced, and backtracking from a module to a previous one does not involve re-processing an entire line, but only the segment that failed. A second motivation is robustness: a failure in one segment does not involve a failure in the entire line, and error-recovery can be limited only to a segment. Further motivations are provided by the colloquial nature of closed captions.
Why We Play Games: Four Keys to More Emotion without Story
August 27, 2006
Nicole Lazzaro, President (Abstract March 8, 2004). Player Experience Research and Design for Mass Market Interactive Entertainment.
Summary
Overcoming the MT Quality Impasse
August 27, 2006
by Steve McClure, Mary Flanagan (contractor). Aug. 2003
Abstract: The difficulty of measuring the quality of automatic language translation systems (known as machine translation [MT]) has been an obstacle to widespread adoption. With systematic benchmark testing, categorization of errors, and effective dictionary customization, MT technology can yield significant cost and time savings, as well as improved consistency in translations.
“The adoption of any new technology by mainstream organizations is driven in part by how well the technology ‘works.’ The key metric for MT is the quality of the resulting translation. Not only is this a somewhat subjective measure, but its definition changes in the context of each application and user,” says Steve McClure, a research vice president in IDC’s Software Research Group. “Quality must be measured in the context of whether the user achieved its objective, not by what percentage of the translation was correct. By applying a proven process individually with each of its enterprise customers, SYSTRAN is ensuring acceptable levels of MT quality.”
My Notes: Systran also allows user rule manipulation (Ford Motor). Nice example of giving the power to the users, by having interact and fix the translation rules themselves.
So now I can say something like this: Given that MT pos-editing is not an easy task, using non-expert users of MT might sound like an unwise idea at first, but GALE evaluation relied on non-expert (yet widely trained) users to post-edit MT output, and even Systran has open up their system so that end users can modify, add and refine their lexicons and grammars.
Excerpts from article
SYSTRAN has also developed the SYSTRAN Review Manager (SRM), which helps the customer to manage the MT quality process by allowing them to change vocabulary and linguistic rules. This tool represents an important advance in MT, both technologically and philosophically. Users have never before had the power to modify linguistic rules through an intuitive, interactive process.
By opening up rule modification, SYSTRAN takes a risk, but one that will
almost certainly pay off. Engaging users in the process of improving MT is
the surest path to increased acceptance and understanding of the technology.
Machine Translation Output Is Not Easily Predictable
MT systems work with natural language – a data set that is infinitely
varying, ambiguous, and structurally complex. To translate adequately, an
MT system must encode knowledge of hundreds of syntactic patterns,
variations, and exceptions, as well as relationships among these patterns.
It must include ever-changing vocabulary and specific semantic knowledge
about the usage patterns of tens of thousands of words. It must accurately
identify the parts of speech and grammatical characteristics of words
which may, in different contexts, be nouns, verbs, or adjectives, each
having many possible translations. Translation also requires a vast store
of knowledge about the world, the intent of the communication, and the
subject matter.
A human translator prioritizes and selectively applies linguistic rules
based on this knowledge. MT software, unless explicitly coded for each
possibility, cannot. Thus, MT will never attain the overall quality of
human translation. The primary advantages of MT over human translation are speed, cost, and consistency. An MT system gets a great deal more
translation done than is possible manually, and MT can deliver
translations instantly for time-sensitive content. When a term is entered
in an MT dictionary, it will translate it the same way every time, unlike
human translators who may choose different translations at different
times.