MT Lunch Seminar (LTI, CMU)

Monday, September 8, 2008

Bilingual-LSA based adaptation for statistical machine translation

Date: 9 Sept 2008
Speaker: Wilson Tam
TITLE: Bilingual-LSA based adaptation for statistical machine translation

ABSTRACT:
We propose a novel approach to crosslingual language model (LM) and translation lexicon adaptation for statistical machine translation based on bilingual Latent Semantic Analysis (bLSA). bLSA enables latent topic distributions to be efficiently transferred across languages by enforcing a one-to-one topic correspondence during training. Using the proposed bLSA framework, model adaptation can be performed by, first, inferring the topic posterior distribution of the source text and then applying the inferred distribution to an N-gram LM of the target language and translation lexicon via marginal adaptation. The background phrase table is then enhanced with the additional phrase scores computed using the adapted translation lexicon.

The proposed framework also features rapid bootstrapping of LSA models for new languages based on a source LSA model of another language. Our approach was evaluated on the Chinese-to-English MT06 test set. Improvement in BLEU was observed when the adapted LM and the adapted translation lexicon were applied individually. When the adapted LM and the adapted lexicon were applied simultaneously, the gain in BLEU was additive yielding 28.91% in BLEU which is statistically significant at the 95% confidence interval with respect to the unadapted baseline with 28.06% in BLEU.

Tuesday, August 12, 2008

Learning from Human Interpreter Speech

Speaker: Matthias Paulik

Title: "Learning from Human Interpreter Speech"
Date: 12 August 2008

Abstract:
Can spoken language translation (SLT) profit from human interpreter speech? In this talk, we explore scenarios which involve live human interpretation, off-line transcription and off-line translation on a massive scale. We consider the deployment of machine translation (MT) and automatic speech recognition (ASR) for the off-line transcription and translation tasks; our systems are trained on 80+ hours of audio data and on parallel text corpora of ~40 million words. To improve performance, we use the available human interpreter speech as an auxiliary information source to bias ASR and MT language models. We evaluate this approach on European Parliament Plenary Session (EPPS) data in three languages (English, Spanish and German), and report preliminary improvements in translation and transcription performance.

Tuesday, July 15, 2008

Improving Lexical Coverage of Syntax-driven MT by Re-structuring Non-isomorphic Trees

Speaker: Vamshi Ambati
Date: 15 July 2008

Abtract:
Syntax-based approaches to statistical MT require syntax-aware methods for acquiring their underlying translation models from parallel data. This acquisition process can be driven by syntactic trees for either the source or target language, or by trees on both sides. Work to date has demonstrated that using trees for both sides suffers from severe coverage problems. Approaches that project from trees on one side, on the other hand, have higher levels of recall, but suffer from lower precision, due to the lack of syntactically-aware word alignments.

In this talk I first discuss extraction and the lexical coverage of the translation models learned in both of these scenarios. We will specifically look at how the non-isomorphic nature of the parse trees for the two languages effects recall and coverage. I will then discuss a novel technique for restructuring target parse trees, that generates highly isomorphic target trees that preserve the syntactic boundaries of constituents that were aligned in the original parse trees. I will conclude by discussing some experimental evaluation with an English-French MT System.

Tuesday, June 10, 2008

Scalable Decoding for Syntax based SMT

Title: Scalable Decoding for Syntax based SMT

Speaker: Abhaya Agarwal
Joint work with Vamshi Ambati

Date: June 10, 2008
Time: 12am - 1:30pm
Location: NSH 3305

Abstract:
Most large scale SMT systems these days use a hybrid model, combining a handful of generative models in a discriminative log-linear framework whose weights are trained on a small amount of development data. Attempts to train fully discriminative MT models that use large number of features, face difficult scalability challenges because it requires repeated decoding of a large training set. In this talk, I will describe a decoder that aims to be efficient enough to allow such training and some initial experiments on training with large number of features.

Tuesday, May 13, 2008

Statistical Transfer MT Systems for French and German

Speaker: Greg Hanneman

Date: Tuesday, May 13, at noon
Location: Wean Hall 4623

Title: Statistical Transfer MT Systems for French and German

Abstract: The AVENUE research group's statistical transfer system is a general framework for creating hybrid machine translation systems. It uses two main resources: a weighted synchronous context-free grammar, and a probabilistic bilingual lexicon of syntax-based word- and phrase-level translations. Over the last six months, we have developed new methods for extracting these resources automatically from parsed and word-aligned parallel corpora. In this talk, I will describe the resource-extraction process as it was applied to new French--English and German--English systems for the upcoming ACL workshop on statistical machine translation. Preliminary evaluation results --- both automatic and human-assessed --- will also be reviewed.

Tuesday, April 22, 2008

Simulating Sentence Pairs Sampling Process via Source and Target Language Models

Speaker: Ngyuen Bach

Abstract: In a traditional word alignment process, each sentence pair is equally assigned an occurrence number, which is normalized during the training to produce the empirical probability. However, some sentences could be more valuable, reliable and appropriate than others. These sentences should therefore have a higher weight in the training. To solve this problem, we explored methods of resampling sentence pairs. We investigated three sets of features: sentence pair confidence (/sc/), genre-dependent sentence pair confidence (/gdsc/) and sentence-dependent phrase alignment confidence (/sdpc/) scores. These features were calculated over an entire training corpus and could easily be integrated into the phrase-based machine translation system.

Wednesday, March 19, 2008

Communicating Unknown Words in Machine Translation

Speaker: Matthias Eck

Title: Communicating Unknown Words in Machine Translation

Abstract:
Unknown words are a major problem for every machine translation system. Regular evaluations and demos do not always show this very well, but in actual communication the lack of specialty vocabulary and named entity translations can seriously affect the communication ability.

A new approach is presented that uses monolingual encyclopedias and dictionaries to "communicate" unknown words. Instead of the actual unknown word, its definition is extracted and translated, which leads to considerable improvements in translation quality.