MT Lunch Seminar (LTI, CMU)

Tuesday, August 12, 2008

Learning from Human Interpreter Speech

Speaker: Matthias Paulik

Title: "Learning from Human Interpreter Speech"
Date: 12 August 2008

Abstract:
Can spoken language translation (SLT) profit from human interpreter speech? In this talk, we explore scenarios which involve live human interpretation, off-line transcription and off-line translation on a massive scale. We consider the deployment of machine translation (MT) and automatic speech recognition (ASR) for the off-line transcription and translation tasks; our systems are trained on 80+ hours of audio data and on parallel text corpora of ~40 million words. To improve performance, we use the available human interpreter speech as an auxiliary information source to bias ASR and MT language models. We evaluate this approach on European Parliament Plenary Session (EPPS) data in three languages (English, Spanish and German), and report preliminary improvements in translation and transcription performance.

Tuesday, July 15, 2008

Improving Lexical Coverage of Syntax-driven MT by Re-structuring Non-isomorphic Trees

Speaker: Vamshi Ambati
Date: 15 July 2008

Abtract:
Syntax-based approaches to statistical MT require syntax-aware methods for acquiring their underlying translation models from parallel data. This acquisition process can be driven by syntactic trees for either the source or target language, or by trees on both sides. Work to date has demonstrated that using trees for both sides suffers from severe coverage problems. Approaches that project from trees on one side, on the other hand, have higher levels of recall, but suffer from lower precision, due to the lack of syntactically-aware word alignments.

In this talk I first discuss extraction and the lexical coverage of the translation models learned in both of these scenarios. We will specifically look at how the non-isomorphic nature of the parse trees for the two languages effects recall and coverage. I will then discuss a novel technique for restructuring target parse trees, that generates highly isomorphic target trees that preserve the syntactic boundaries of constituents that were aligned in the original parse trees. I will conclude by discussing some experimental evaluation with an English-French MT System.

Tuesday, June 10, 2008

Scalable Decoding for Syntax based SMT

Title: Scalable Decoding for Syntax based SMT

Speaker: Abhaya Agarwal
Joint work with Vamshi Ambati

Date: June 10, 2008
Time: 12am - 1:30pm
Location: NSH 3305

Abstract:
Most large scale SMT systems these days use a hybrid model, combining a handful of generative models in a discriminative log-linear framework whose weights are trained on a small amount of development data. Attempts to train fully discriminative MT models that use large number of features, face difficult scalability challenges because it requires repeated decoding of a large training set. In this talk, I will describe a decoder that aims to be efficient enough to allow such training and some initial experiments on training with large number of features.

Tuesday, May 13, 2008

Statistical Transfer MT Systems for French and German

Speaker: Greg Hanneman

Date: Tuesday, May 13, at noon
Location: Wean Hall 4623

Title: Statistical Transfer MT Systems for French and German

Abstract: The AVENUE research group's statistical transfer system is a general framework for creating hybrid machine translation systems. It uses two main resources: a weighted synchronous context-free grammar, and a probabilistic bilingual lexicon of syntax-based word- and phrase-level translations. Over the last six months, we have developed new methods for extracting these resources automatically from parsed and word-aligned parallel corpora. In this talk, I will describe the resource-extraction process as it was applied to new French--English and German--English systems for the upcoming ACL workshop on statistical machine translation. Preliminary evaluation results --- both automatic and human-assessed --- will also be reviewed.

Tuesday, April 22, 2008

Simulating Sentence Pairs Sampling Process via Source and Target Language Models

Speaker: Ngyuen Bach

Abstract: In a traditional word alignment process, each sentence pair is equally assigned an occurrence number, which is normalized during the training to produce the empirical probability. However, some sentences could be more valuable, reliable and appropriate than others. These sentences should therefore have a higher weight in the training. To solve this problem, we explored methods of resampling sentence pairs. We investigated three sets of features: sentence pair confidence (/sc/), genre-dependent sentence pair confidence (/gdsc/) and sentence-dependent phrase alignment confidence (/sdpc/) scores. These features were calculated over an entire training corpus and could easily be integrated into the phrase-based machine translation system.

Wednesday, March 19, 2008

Communicating Unknown Words in Machine Translation

Speaker: Matthias Eck

Title: Communicating Unknown Words in Machine Translation

Abstract:
Unknown words are a major problem for every machine translation system. Regular evaluations and demos do not always show this very well, but in actual communication the lack of specialty vocabulary and named entity translations can seriously affect the communication ability.

A new approach is presented that uses monolingual encyclopedias and dictionaries to "communicate" unknown words. Instead of the actual unknown word, its definition is extracted and translated, which leads to considerable improvements in translation quality.

Tuesday, November 13, 2007

Trees that can help

Speaker: Alok Parlikar

Title: (S (NP (NP Trees) (SBAR (WHNP that) (S (VP can)))) (VP help))

Summary:

For the past two months, I have been working with Alon Lavie and Stephan
Vogel, on Chinese and English parse-trees, to investigate answers to the
following questions:

(a) Can constituency information and word level alignments be used to
align nodes in trees of parallel sentences? How precisely matched
(in meaning) are the yields of these aligned nodes?
(b) Can the parse trees and word-level alignments be used for learning
reordering rules? If we use these rules to reorder source sentences,
can we do any better at translation?

The current results show that:

(a) - Node Alignments from hand-aligned data are very precise.
- Using automatic word-alignments to align nodes gives over 70%
precision and over 40% recall.
(b) Using a 10-best reordering of words in the source sentences, with
a "dumb" reordering strategy has shown a 0.005 improvement in BLEU
score.

I would like to talk about the approaches that we have taken here, and to
discuss about strategies for improving these results.