In 2004, less than 1% of the 6800 languages of the world profits from a high level of computerization, including a broad range of services going from text processing to machine translation. This thesis, which focuses on the other languages – the pi-languages – aims at proposing solutions to cure their digital underdevelopment. In a first part, intended to show the complexity of the problem, we present the languages’ diversity, the technologies used, as well as the approaches of the various actors: linguistic populations, software publishers, the United Nations, States… A technique for measuring the computerization degree of a language – the sigma-index – is proposed, as well as several optimization methods. The second part deals with the computerization of the Laotian language and concretely presents the results obtained for this language by applying the methods described previously. The described achievements contributed to improve the sigma-index of the Laotian language by approximately 4 points, this index being currently evaluated with 8.7/20. In the third part, we show that an approach by groups of languages can reduce the computerization costs thanks to the use of a modular architecture associating existing general software and specific complements. For the most language-related parts, complementary generic lingware tools give the populations the possibility to computerize their languages by themselves. We validated this method by applying it to the syllabic segmentation of Southeast Asian languages with unsegmented writings, such as Burmese, Khmer, Laotian and Siamese (Thai).


From an email Christian Boitet sent to

1) On the terms tau-, mu-, pi- languages and pairs of languages

The point is to CHARACTERIZE in an EXACT an NON-DEPRECATING way languages and pairs of languages for which there is a lack of computerized resources and tools used or directly usable in NLP applications concerning them.

By the way, I forgot to include “pi-pairs” in the previous e-mail, but they do exist.

A pi-pair of languages is a pair for which NLP-related data, resources and tools are lacking. [pi=poorly informatisées?].

read more about that here:
Méthodes pour informatiser les langues et les groupes de langues «peu dotées» by Vincent Berment
Berment also uses the terms:
– tau-language (pair) = well (totally / très bien) equipped
– mu-language (pair) = medium (moyennement bien) equipped

Example: while French and Thai are reasonably “NLP-equipped” (tau-language and mu-language), the 2 pairs FT, TF are not.

Example: Spanish is a tau-language, Catalan and Galician are mu or pi languages, the pairs SC and SG are Tau-pairs because there are 2 quite good MT systems translating newspapers ofr these pairs (Comprendium, using the METAL shell, see Proc.

2) Other terms proposed and why they are not good terms for these concepts

The terms

* minority languages
* less-prevalent languages
* less(er) widely used languages
* less-dominant (non-dominant) languages
* traditionally oral/spoken/unwritten languages
* endangered languages
* indigenous languages
* neglected languages
* New Member State languages (used for the new languages of the European Union)

don’t really say anything about the degree of “equipement” as far as computer applications are concerned, and many of them are deprecating in some way.

(I agree 100% with Jeff Allen on that!)

The terms

* sparse-data languages
* low-density languages

also don’t fit:

– The idea that data is “sparse” means there ARE data, but in fragmentary and heterogeneous form.
But pi-languages often have NO data or resources usable, even for simple applications such as hyphenation — where are “sparse data” for hyphenating khmer?

– “Low-density” is quite worse as it can only mean that a language is spoken by a small fraction of the population where it is spoken.
But what can be the reference? A country? A region? — To the extreme, almost any language is of high density in families where it is spoken.

About the 2 other terms proposed:

* commercially disadvantaged/inhibited/challenged languages
* low market-value languages

These terms also miss the point above. A language may suddenly acquire a high market value (see Chinese since 10-15 years), or lose it somewhat (e.g., Russian since 1991), this is independent of the resources and tools existing for it. The reason is that these are often developed NOT in order to build commercial products. Why were Eurodicautom, Euramis and EuroParl developed?

When NLP firms will discover than
Malay/Indonesian can be commercially interesting, they will find there are quite a lot of resources for them, including a modern unified terminology (istilah). But if the same happens for tagalog (or maybe for swahili), they will find next to nothing usable to quickly build applications for them.

Best regards,


In theory, if we had a very good LM trained on huge amounts of data, the kind of errors that can be corrected by a monolingual posteditor (GALE), which are generation, fluency errors, should be already taken care of by such LM, right?

The problem is that even LMs trained on really large datasets face sparseness problems for high-order models. From a practical point of vew, since the number of parameters of an n-gram model is O(|W|^n), finding the resources to compute and store all these parameters becomes a hopeless task for n > 5. Or even if we did (read Google did), in actual text, the majority of n-grams that one sees are bigrams or at most trigrams, and it’s very rear to see very high n-grams.

Therefore, current LMs, in spite having smoothed and improved MT output, still generate disfluencies.

An aside: I love the term lexical miopia and shortsightedness to describe low n-gram models (Beeferman et al. 1997).