Tuesday, November 21, 2006

Simulating Multiple Translations and ASR Transcripts for Applications in Multilingual Spoken Document Classification

Title: Simulating Multiple Translations and ASR Transcripts for Applications in Multilingual Spoken Document Classification

Speaker: Wei-Hao Lin from the Informedia group


Abstract:
We propose a statistical model to simulate multiple documents and
their translations (e.g. Chinese documents and their English
translations), and apply the model in the task of classifying
multilingual documents. The model, based on a frequency matching
principle, predicts that previous approaches to building classifiers
from a common language (e.g., English) are not optimal for
multilingual collections with unbalanced numbers of documents, and a
proposed multilingual representation can outperform the mono-lingual
bag-of-words representation. We also investigate the possibility of
combining multiple ASR transcripts and translations through
re-weighting. The validity of our model is strongly supported by
the close match between predictions of the simulation model and the
empirical results of classifying multilingual spoken documents from
broadcast news in three languages.