In 2004, less than 1% of the 6800 languages of the world profits from a high level of computerization, including a broad range of services going from text processing to machine translation. This thesis, which focuses on the other languages – the pi-languages – aims at proposing solutions to cure their digital underdevelopment. In a first part, intended to show the complexity of the problem, we present the languages’ diversity, the technologies used, as well as the approaches of the various actors: linguistic populations, software publishers, the United Nations, States… A technique for measuring the computerization degree of a language – the sigma-index – is proposed, as well as several optimization methods. The second part deals with the computerization of the Laotian language and concretely presents the results obtained for this language by applying the methods described previously. The described achievements contributed to improve the sigma-index of the Laotian language by approximately 4 points, this index being currently evaluated with 8.7/20. In the third part, we show that an approach by groups of languages can reduce the computerization costs thanks to the use of a modular architecture associating existing general software and specific complements. For the most language-related parts, complementary generic lingware tools give the populations the possibility to computerize their languages by themselves. We validated this method by applying it to the syllabic segmentation of Southeast Asian languages with unsegmented writings, such as Burmese, Khmer, Laotian and Siamese (Thai).

From an email Christian Boitet sent to mt-list@eamt.org:

1) On the terms tau-, mu-, pi- languages and pairs of languages

The point is to CHARACTERIZE in an EXACT an NON-DEPRECATING way languages and pairs of languages for which there is a lack of computerized resources and tools used or directly usable in NLP applications concerning them.

By the way, I forgot to include “pi-pairs” in the previous e-mail, but they do exist.

A pi-pair of languages is a pair for which NLP-related data, resources and tools are lacking. [pi=poorly informatisées?].

read more about that here:
Méthodes pour informatiser les langues et les groupes de langues «peu dotées» by Vincent Berment
Berment also uses the terms:
- tau-language (pair) = well (totally / très bien) equipped
- mu-language (pair) = medium (moyennement bien) equipped

Example: while French and Thai are reasonably “NLP-equipped” (tau-language and mu-language), the 2 pairs FT, TF are not.

Example: Spanish is a tau-language, Catalan and Galician are mu or pi languages, the pairs SC and SG are Tau-pairs because there are 2 quite good MT systems translating newspapers ofr these pairs (Comprendium, using the METAL shell, see Proc.
EAMT-05).

2) Other terms proposed and why they are not good terms for these concepts

The terms

* minority languages
* less-prevalent languages
* less(er) widely used languages
* less-dominant (non-dominant) languages
* traditionally oral/spoken/unwritten languages
* endangered languages
* indigenous languages
* neglected languages
* New Member State languages (used for the new languages of the European Union)

don’t really say anything about the degree of “equipement” as far as computer applications are concerned, and many of them are deprecating in some way.

(I agree 100% with Jeff Allen on that!)

The terms

* sparse-data languages
* low-density languages

also don’t fit:

- The idea that data is “sparse” means there ARE data, but in fragmentary and heterogeneous form.
But pi-languages often have NO data or resources usable, even for simple applications such as hyphenation — where are “sparse data” for hyphenating khmer?

- “Low-density” is quite worse as it can only mean that a language is spoken by a small fraction of the population where it is spoken.
But what can be the reference? A country? A region? — To the extreme, almost any language is of high density in families where it is spoken.

About the 2 other terms proposed:

* commercially disadvantaged/inhibited/challenged languages
* low market-value languages

These terms also miss the point above. A language may suddenly acquire a high market value (see Chinese since 10-15 years), or lose it somewhat (e.g., Russian since 1991), this is independent of the resources and tools existing for it. The reason is that these are often developed NOT in order to build commercial products. Why were Eurodicautom, Euramis and EuroParl developed?

When NLP firms will discover than
Malay/Indonesian can be commercially interesting, they will find there are quite a lot of resources for them, including a modern unified terminology (istilah). But if the same happens for tagalog (or maybe for swahili), they will find next to nothing usable to quickly build applications for them.

Best regards,

Ch.Boitet

By Erick Schonfeld, Om Malik, and Michael V. Copeland

SOCIAL MEDIA

Incumbent To Watch: Yahoo!
Hoping to dominate social media, it’s gobbling up promising startups (Del.icio.us, Flickr, Webjay) and experimenting with social search (My Web 2.0) that ranks results based on shared bookmarks and tags.

MASHUPS AND FILTERS

Incumbent To Watch: Google
Already the ultimate Web filter through general search as well as blog, news, shopping, and now video search, it’s encouraging mashups of Google Maps and search results, and offers a free RSS reader.

THE NEW PHONE

For nearly a century, the phone, and voice as we know it, has existed largely in the confines of a thin copper wire. But now service providers can convert voice calls into tiny Internet packets and let them loose on fast connections, thus mimicking the traditional voice experience without spending hundreds of millions on infrastructure. All you need are powerful–but cheap–computers running specialized software. The Next Net will be the new phone, creating fertile ground for new businesses.

Incumbent To Watch: eBay (Skype)
The pioneer in the field and still the front-runner, Skype brings together free calling, IM, and video calling over the Web; eBay will use it to create deeper connections between buyers and sellers. [And I'd say Google Talk is following closely...]

THE WEBTOP

It’s been a long time — all the way back to the dawn of desktop computing in the early 1980s — since software coders have had as much fun as they’re having right now. But today, browser-based applications are where the action is. A killer app no longer requires hundreds of drones slaving away on millions of lines of code. Three or four engineers and a steady supply of Red Bull is all it takes to rapidly turn a midnight brainstorm into a website so hot it melts the servers. What has changed is the way today’s Web-based apps can run almost as seamlessly as programs used on the desktop, with embedded audio, video, and drag-and-drop ease of use. Company: 37Signals (Chicago)
What it is: Online project management
Next Net bona fides: Its Basecamp app, elegant and inexpensive, enables the creation, sharing, and tracking of to-do lists, files, performance milestones, and other key project metrics; related app Backpack, recently released, is a powerful online organizer for individuals.
Company: Writely (Portola Valley, CA)
What it is: Online word processing
Next Net bona fides: It enables online creation of documents, opens them to collaboration by anyone anywhere, and simplifies publishing the end result on a website as a blog entry.

UNDER THE HOOD

A growing number of companies are either offering themselves as Web-based platforms on which other software and businesses can be built or developing basic tools that make some of the defining hallmarks of the Next Net possible.

Incumbent To Watch: Amazon
It’s becoming a major Web platform by opening up its software protocols and encouraging anyone to use its catalog and other data; its Alexa Web crawler, which indexes the Net, can be used as the basis for other search engines, and its Mechanical Turk site solicits humans across cyberspace to do things that computers still can’t do well, such as identify images or transcribe podcasts.

In theory, if we had a very good LM trained on huge amounts of data, the kind of errors that can be corrected by a monolingual posteditor (GALE), which are generation, fluency errors, should be already taken care of by such LM, right?

The problem is that even LMs trained on really large datasets face sparseness problems for high-order models. From a practical point of vew, since the number of parameters of an n-gram model is O(|W|^n), finding the resources to compute and store all these parameters becomes a hopeless task for n > 5. Or even if we did (read Google did), in actual text, the majority of n-grams that one sees are bigrams or at most trigrams, and it’s very rear to see very high n-grams.

Therefore, current LMs, in spite having smoothed and improved MT output, still generate disfluencies.

An aside: I love the term lexical miopia and shortsightedness to describe low n-gram models (Beeferman et al. 1997).

Doug Beeferman, Adam Berger, John Lafferty (1997). Proceedings of the Second Conference on Empirical Methods in Natural Language Processing.

Abstract: This paper introduces a new statistical approach to partitioning text automatically into coherent segments. Our approach enlists both short-range and long-range language models to help it sniff out likely sites of topic changes in text. To aid its search, the system consults a set of simple lexical hints it has learned to associate with the presence of boundaries through inspection of a large corpus of annotated data. We also propose a new probabilistically motivated error metric for use by the natural language processing and information retrieval communities, intended to supersede precision and recall for appraising segmentation algorithms. Qualitative assessment of our algorithm as well as evaluation using this new metric demonstrates the effectiveness of our approach in two very different domains, Wall Street Journal articles and the TDT Corpus, a collection of newswire articles and broadcast news transcripts.

My Notes: Partitioning is at the text document level, not at the sentence level, used to segment large collections of texts (IR).

Splitting long sentences into fluent and coherent shorter sentences is much harder to do automatically, since it would require some sort of language generation module, which could turn sentential fragments into sentences. Has anybody looked at this problem?
An aside: I love the term lexical miopia and shortsightedness to describe low n-gram models.

by Alon Lavie, Donna Gates, Noah Coccaro and Lori Levin (1996). ECAI Workshop on Dialogue Processing in Spoken Language Systems.

Abstract: JANUS is a multi-lingual speech-to-speech translation system designed to facilitate communication between two parties engaged in a spontaneous conversation in a limited domain. In this paper we describe how multi-level segmentation of single utterance turns improves translation quality and facilitates accurate translation in our system. We define the basic dialogue units that are handled by our system, and discuss the cues and methods employed by the system in segmenting the input utterance into such units. Utterance segmentation in our system is performed in a multi-level incremental fashion, partly prior and partly during analysis by the parser. The segmentation relies on a combination of acoustic, lexical, semantic and statistical knowledge sources, which are described in detail in the paper. We also discuss how our system is designed to disambiguate among alterantive possible input segmentations.

My Notes: Split input into semantic dialog units (~= speech act), namely semantically coherent pieces of information that can be translated independently.

Mellebeek, Bart; Owczarzak, Karolina; Van Genabith, Josef & Way, Andy. (2006). AMTA, Boston, MA.

Original paper on TransBooster project is: B. Mellebeek, A. Khasin, J. Van Genabith, A. Way. 2005. TransBooster: Boosting the Performance of Wide-Coverage Machine Translation Systems. In Proceedings of the 10th Annual Conference of the European Association for Machine Translation. pp. 189-197, Budapest, Hungary.

Abstract: In this paper, we present a novel approach to combine the outputs of multiple MT engines into a consensus translation. In contrast to previous Multi-Engine Machine
Translation (MEMT) techniques, we do ot rely on word alignments of output hypotheses, but prepare the input sentence or multi-engine processing. We do this by using a recursive decomposition algorithm hat produces simple chunks as input to the MT engines. A consensus translation is produced by combining the best chunk translations, selected through majority voting, a trigram language model score and a confidence score assigned to each MT engine. We report statistically significant relative improvements
of up to 9% BLEU score in experiments (English->Spanish) carried out on an 800-
sentence test set extracted from the Penn-II Treebank.

Summary: They describe an algorithm for splitting input sentences into syntactically meaningful chunks (according to a parser/human) and simplifying the arguments of a pivot (head of the chunk) to facilitate the machine translation process of the simplified chunks in (dynamically simplified) context.

My Notes: this work shows that splitting up long input sentences into shorter one, can actually lead to improvement of MT output in terms of BLEU. Therefore having a game with a purpose trying to do this using humans, becomes less relevant.

Excerpts
In contrast to previous MEMT approaches, the technique we present does not rely on word alignments of target language sentences, but is based on recursive chunking algorithm that produces simple constituents as input to the MT engines. The outputs of these syntactically meaningful chunks are compared to each other and the highest ranked translations are used to compose the output sentence. Our approach, therefore, prepares the input sentence for multi-engine processing on the input side. It draws its strength from the simple fact that short input strings result in better translations than longer ones.

The decomposition into chunks, the tracking of the output chunks in target and the final composition of the output are based on the TransBooster architecture presented in (Mellebeek et al., 2005) [EAMTA, Budapest].

Our approach presupposes the existence of some sort of syntactic analysis of the input sentence. In a first step, the input sentence is decomposed into a number of syntactically meaningful chunks as in (1).
(1) [ARG_1] [ADJ_1]. . . [ARG_L] [ADJ_l] pivot [ARG_L+1] [ADJ_l+1]. . . [ARG_L+R] [ADJ_l+r]
where pivot = the nucleus of the sentence, ARG = argument, ADJ = adjunct, {l,r} = number of ADJs to left/right of pivot, and {L,R} = number of ARGs to left/right of pivot.
In order to determine the pivot, we compute the head of the local tree by adapting the headlexicalised rammar annotation scheme of (Magerman, 1995). In certain cases, we derive a ‘complex pivot’ consisting of the head terminal together with some of its neighbours, e.g. phrasal verbs or strings of auxiliaries. The procedure used for argument/
adjunct identification is an adapted version of Hockenmaier’s algorithm for CCG (Hockenmaier, 2003).

In a next step, we replace the arguments by similar but simpler strings, which we call ‘Substitution Variables’. The purpose of Substitution Variables is: (i) to help to reduce the complexity of the original arguments, which often leads to an improved translation of the pivot; (ii) to help keep track of the location of the translation of the arguments in target.
In choosing an optimal Substitution Variable for a constituent, there exists a trade-off between accuracy and retrievability. ‘Static’ or previously defined Substitution Variables (e.g. ‘cars’ to replace the NP ‘fast and confidential deals’ as explained in section 3.5) are easy to track in target, since their translation by a specific MT engine is known in advance,
but they might distort the translation of the pivot because of syntactic/semantic differences with the original constituent. ‘Dynamic’ Substitution Variables comprise the real heads of the constituent (e.g. ‘deals’ to replace the NP ‘fast and confidential deals’
as outlined in section 3.5) guarantee a maximum similarity, but are more difficult to track in target.
Our algorithm employs Dynamic Substitution Variables first and automatically backs off to Static Substitution Variables if problems occur. By replacing the arguments by their Substitution Variables and leaving out the adjuncts in (1), we obtain the skeleton
in (2)

(2) [VARG_1 ] . . . [VARG_L] pivot [VARG_L+1] . . . [VARG_L+R]
where VARGi is the simpler string substituting ARGi
By matching the previously established translations of the Substitution Variables VARGi (1 <= i <= L + R) in the translation of the skeleton in (2), we are able to (i) extract the translation of the pivot and (ii) track the location of the translated arguments in target. The result of this second step on the worked example is shown in (6). Adjuncts are located in target by using a similar strategy in which adjunct Substitution Variables are
added to the skeleton in (2).

Since translating individual chunks out of context is likely to produce a deficient output or lead to boundary friction, we need to ensure that each chunk is translated in a simple context that mimics the original.
As in the case of the Substitution Variables, this context can be static (a previously established template, the translation of which is known in advance) or dynamic (a simpler version of the original context).
Our approach is based on the idea that by reducing the complexity of the original context, the analysis modules of the MT engines are more likely to produce a better translation of the input chunk Ci than if it were left intact in the original sentence, which contains more syntactic and semantic ambiguities.
In other words, we try to improve on the translation C_ji of chunk C_i by MT engine j through input simplification. (cf. section 3.5 for more details)
After obtaining the translations of all input chunks by all MT engines (C_i1 – C_iN ), all that remains to be done is to select the best output translation C_i_best for each chunk C_i and derive the output by composing all C_i_best . This is possible since we have kept track of the position of each C_ij by the Substitution Variables.