Monday, December 8, 2008

Fast MT Pipeline: Introduction to tools you can use

Date: 09-Dec-2008

Qin and Alok will report on their work to speed up some
of the MT processing by using parallel processing, with
an emphasis on the tools they have developed, to do this
kind of work.

Title: Fast MT Pipeline: Introduction to tools you can use.

Abstract: In this talk, we would like to introduce you to some
recently developed tools available for you to use, in order to
speed up the MT pipeline. The tools of focus are:
(i) multi-threaded giza: faster word alignment.
(ii) chaksi: Phrase-Extraction on the M45 cluster.
(iii) trambo: Decoding/MERT on the M45 cluster.

Sunday, November 16, 2008


Date: 11 Nov 2008
Time: 12-1:30
Room: NSH 3305


Andreas Zollmann: Wider Pipelines: N-Best Alignments and Parses in MT Training

State-of-the-art statistical machine translation systems use hypotheses from several maximum a posteriori inference steps, including word alignments and parse trees, to identify translational structure and estimate the parameters of translation models. While this approach leads to a modular pipeline of independently developed components, errors made in these “single-best” hypotheses can propagate to downstream estimation steps that treat these inputs as clean, trustworthy training data. In this work we integrate N-best alignments and parses by using a probability distribution over these alternatives to generate posterior fractional counts for use in downstream estimation. Using these fractional counts in a DOPinspired syntax-based translation system, we show significant improvements in translation quality over a single-best trained baseline.

Silja Hildebrand: Combination of Machine Translation Systems via Hypothesis Selection from Combined N-Best Lists

Different approaches in machine translation achieve similar translation quality with a variety of translations in the output. Recently it has been shown, that it is possible to leverage the individual strengths of various systems and improve the overall translation quality by combining translation outputs. In this paper we present a method of hypothesis selection which is relatively simple compared to system combination methods which construct a synthesis of the input hypotheses. Our method uses information from n-best lists from several MT systems and features on the sentence level which are independent from the MT systems involved to improve the translation quality.

Monday, September 8, 2008

Bilingual-LSA based adaptation for statistical machine translation

Date: 9 Sept 2008
Speaker: Wilson Tam
TITLE: Bilingual-LSA based adaptation for statistical machine translation

We propose a novel approach to crosslingual language model (LM) and translation lexicon adaptation for statistical machine translation based on bilingual Latent Semantic Analysis (bLSA). bLSA enables latent topic distributions to be efficiently transferred across languages by enforcing a one-to-one topic correspondence during training. Using the proposed bLSA framework, model adaptation can be performed by, first, inferring the topic posterior distribution of the source text and then applying the inferred distribution to an N-gram LM of the target language and translation lexicon via marginal adaptation. The background phrase table is then enhanced with the additional phrase scores computed using the adapted translation lexicon.

The proposed framework also features rapid bootstrapping of LSA models for new languages based on a source LSA model of another language. Our approach was evaluated on the Chinese-to-English MT06 test set. Improvement in BLEU was observed when the adapted LM and the adapted translation lexicon were applied individually. When the adapted LM and the adapted lexicon were applied simultaneously, the gain in BLEU was additive yielding 28.91% in BLEU which is statistically significant at the 95% confidence interval with respect to the unadapted baseline with 28.06% in BLEU.

Tuesday, August 12, 2008

Learning from Human Interpreter Speech

Speaker: Matthias Paulik

Title: "Learning from Human Interpreter Speech"
Date: 12 August 2008

Can spoken language translation (SLT) profit from human interpreter speech? In this talk, we explore scenarios which involve live human interpretation, off-line transcription and off-line translation on a massive scale. We consider the deployment of machine translation (MT) and automatic speech recognition (ASR) for the off-line transcription and translation tasks; our systems are trained on 80+ hours of audio data and on parallel text corpora of ~40 million words. To improve performance, we use the available human interpreter speech as an auxiliary information source to bias ASR and MT language models. We evaluate this approach on European Parliament Plenary Session (EPPS) data in three languages (English, Spanish and German), and report preliminary improvements in translation and transcription performance.

Tuesday, July 15, 2008

Improving Lexical Coverage of Syntax-driven MT by Re-structuring Non-isomorphic Trees

Speaker: Vamshi Ambati
Date: 15 July 2008

Syntax-based approaches to statistical MT require syntax-aware methods for acquiring their underlying translation models from parallel data. This acquisition process can be driven by syntactic trees for either the source or target language, or by trees on both sides. Work to date has demonstrated that using trees for both sides suffers from severe coverage problems. Approaches that project from trees on one side, on the other hand, have higher levels of recall, but suffer from lower precision, due to the lack of syntactically-aware word alignments.

In this talk I first discuss extraction and the lexical coverage of the translation models learned in both of these scenarios. We will specifically look at how the non-isomorphic nature of the parse trees for the two languages effects recall and coverage. I will then discuss a novel technique for restructuring target parse trees, that generates highly isomorphic target trees that preserve the syntactic boundaries of constituents that were aligned in the original parse trees. I will conclude by discussing some experimental evaluation with an English-French MT System.

Tuesday, June 10, 2008

Scalable Decoding for Syntax based SMT

Title: Scalable Decoding for Syntax based SMT

Speaker: Abhaya Agarwal
Joint work with Vamshi Ambati

Date: June 10, 2008
Time: 12am - 1:30pm
Location: NSH 3305

Most large scale SMT systems these days use a hybrid model, combining a handful of generative models in a discriminative log-linear framework whose weights are trained on a small amount of development data. Attempts to train fully discriminative MT models that use large number of features, face difficult scalability challenges because it requires repeated decoding of a large training set. In this talk, I will describe a decoder that aims to be efficient enough to allow such training and some initial experiments on training with large number of features.

Tuesday, May 13, 2008

Statistical Transfer MT Systems for French and German

Speaker: Greg Hanneman

Date: Tuesday, May 13, at noon
Location: Wean Hall 4623

Title: Statistical Transfer MT Systems for French and German

Abstract: The AVENUE research group's statistical transfer system is a general framework for creating hybrid machine translation systems. It uses two main resources: a weighted synchronous context-free grammar, and a probabilistic bilingual lexicon of syntax-based word- and phrase-level translations. Over the last six months, we have developed new methods for extracting these resources automatically from parsed and word-aligned parallel corpora. In this talk, I will describe the resource-extraction process as it was applied to new French--English and German--English systems for the upcoming ACL workshop on statistical machine translation. Preliminary evaluation results --- both automatic and human-assessed --- will also be reviewed.

Tuesday, April 22, 2008

Simulating Sentence Pairs Sampling Process via Source and Target Language Models

Speaker: Ngyuen Bach

Abstract: In a traditional word alignment process, each sentence pair is equally assigned an occurrence number, which is normalized during the training to produce the empirical probability. However, some sentences could be more valuable, reliable and appropriate than others. These sentences should therefore have a higher weight in the training. To solve this problem, we explored methods of resampling sentence pairs. We investigated three sets of features: sentence pair confidence (/sc/), genre-dependent sentence pair confidence (/gdsc/) and sentence-dependent phrase alignment confidence (/sdpc/) scores. These features were calculated over an entire training corpus and could easily be integrated into the phrase-based machine translation system.