Text Segmentation Using Exponential Models

August 28, 2006

Doug Beeferman, Adam Berger, John Lafferty (1997). Proceedings of the Second Conference on Empirical Methods in Natural Language Processing.

Abstract: This paper introduces a new statistical approach to partitioning text automatically into coherent segments. Our approach enlists both short-range and long-range language models to help it sniff out likely sites of topic changes in text. To aid its search, the system consults a set of simple lexical hints it has learned to associate with the presence of boundaries through inspection of a large corpus of annotated data. We also propose a new probabilistically motivated error metric for use by the natural language processing and information retrieval communities, intended to supersede precision and recall for appraising segmentation algorithms. Qualitative assessment of our algorithm as well as evaluation using this new metric demonstrates the effectiveness of our approach in two very different domains, Wall Street Journal articles and the TDT Corpus, a collection of newswire articles and broadcast news transcripts.

My Notes: Partitioning is at the text document level, not at the sentence level, used to segment large collections of texts (IR).

Splitting long sentences into fluent and coherent shorter sentences is much harder to do automatically, since it would require some sort of language generation module, which could turn sentential fragments into sentences. Has anybody looked at this problem?
An aside: I love the term lexical miopia and shortsightedness to describe low n-gram models.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: