MT Lunch Seminar (LTI, CMU)

Tuesday, February 3, 2009

An Overview of Tree-to-String Translation Models: Yang Liu

Speaker: Yang Liu
Title: An Overview of Tree-to-String Translation Models

Abstract:

Recent research on statistical machine translation has lead to the rapid development of syntax-based translation models, in which syntactic information can be exploited to direct translation. In this talk, I will give an overview of tree-to-string translation models, one of the state-of-the-art syntax-based models. In a tree-to-string model, the source side is a phrase structure parse tree and the target side is a string. This talk includes the following topics: (1) naive tree-to-string model, (2) tree-sequence based tree-to-string model, (3) context-aware tree-to-string model, and (4) forest-based tree-to-string model. Experimental results show that forest-based tree-to-string model outperforms hierarchical phrase-based model significantly.

Short Bio:

Yang Liu is an Assistant Researcher at Institute of Computing Technology, Chinese Academy of Sciences. He graduated in Computer Science from Wuhan University in 2002. He received his PhD degree in Computer Science from Institute of Computing Technology, Chinese Academy of Sciences. His major research interests include statistical machine translation and Chinese information processing. His publications on discriminative word alignment and tree-to-string models have received wide attention. He served as PC member/Reviewer for TALIP, ACL, EMNLP, AMTA, and SSST.

Tuesday, January 13, 2009

Parallel Treebanks in Machine Translation

Title: Parallel Treebanks in Machine Translation
Speaker: John Tinsley, Ph.D. student at the National Centre for Language Technology in DCU

Monday, December 8, 2008

Fast MT Pipeline: Introduction to tools you can use

Date: 09-Dec-2008

Qin and Alok will report on their work to speed up some
of the MT processing by using parallel processing, with
an emphasis on the tools they have developed, to do this
kind of work.

Title: Fast MT Pipeline: Introduction to tools you can use.

Abstract: In this talk, we would like to introduce you to some
recently developed tools available for you to use, in order to
speed up the MT pipeline. The tools of focus are:
(i) multi-threaded giza: faster word alignment.
(ii) chaksi: Phrase-Extraction on the M45 cluster.
(iii) trambo: Decoding/MERT on the M45 cluster.

Sunday, November 16, 2008

Presentations

Date: 11 Nov 2008
Time: 12-1:30
Room: NSH 3305

Presentations:

Andreas Zollmann: Wider Pipelines: N-Best Alignments and Parses in MT Training

State-of-the-art statistical machine translation systems use hypotheses from several maximum a posteriori inference steps, including word alignments and parse trees, to identify translational structure and estimate the parameters of translation models. While this approach leads to a modular pipeline of independently developed components, errors made in these “single-best” hypotheses can propagate to downstream estimation steps that treat these inputs as clean, trustworthy training data. In this work we integrate N-best alignments and parses by using a probability distribution over these alternatives to generate posterior fractional counts for use in downstream estimation. Using these fractional counts in a DOPinspired syntax-based translation system, we show significant improvements in translation quality over a single-best trained baseline.

Silja Hildebrand: Combination of Machine Translation Systems via Hypothesis Selection from Combined N-Best Lists

Different approaches in machine translation achieve similar translation quality with a variety of translations in the output. Recently it has been shown, that it is possible to leverage the individual strengths of various systems and improve the overall translation quality by combining translation outputs. In this paper we present a method of hypothesis selection which is relatively simple compared to system combination methods which construct a synthesis of the input hypotheses. Our method uses information from n-best lists from several MT systems and features on the sentence level which are independent from the MT systems involved to improve the translation quality.

Monday, September 8, 2008

Bilingual-LSA based adaptation for statistical machine translation

Date: 9 Sept 2008
Speaker: Wilson Tam
TITLE: Bilingual-LSA based adaptation for statistical machine translation

ABSTRACT:
We propose a novel approach to crosslingual language model (LM) and translation lexicon adaptation for statistical machine translation based on bilingual Latent Semantic Analysis (bLSA). bLSA enables latent topic distributions to be efficiently transferred across languages by enforcing a one-to-one topic correspondence during training. Using the proposed bLSA framework, model adaptation can be performed by, first, inferring the topic posterior distribution of the source text and then applying the inferred distribution to an N-gram LM of the target language and translation lexicon via marginal adaptation. The background phrase table is then enhanced with the additional phrase scores computed using the adapted translation lexicon.

The proposed framework also features rapid bootstrapping of LSA models for new languages based on a source LSA model of another language. Our approach was evaluated on the Chinese-to-English MT06 test set. Improvement in BLEU was observed when the adapted LM and the adapted translation lexicon were applied individually. When the adapted LM and the adapted lexicon were applied simultaneously, the gain in BLEU was additive yielding 28.91% in BLEU which is statistically significant at the 95% confidence interval with respect to the unadapted baseline with 28.06% in BLEU.

Tuesday, August 12, 2008

Learning from Human Interpreter Speech

Speaker: Matthias Paulik

Title: "Learning from Human Interpreter Speech"
Date: 12 August 2008

Abstract:
Can spoken language translation (SLT) profit from human interpreter speech? In this talk, we explore scenarios which involve live human interpretation, off-line transcription and off-line translation on a massive scale. We consider the deployment of machine translation (MT) and automatic speech recognition (ASR) for the off-line transcription and translation tasks; our systems are trained on 80+ hours of audio data and on parallel text corpora of ~40 million words. To improve performance, we use the available human interpreter speech as an auxiliary information source to bias ASR and MT language models. We evaluate this approach on European Parliament Plenary Session (EPPS) data in three languages (English, Spanish and German), and report preliminary improvements in translation and transcription performance.

Tuesday, July 15, 2008

Improving Lexical Coverage of Syntax-driven MT by Re-structuring Non-isomorphic Trees

Speaker: Vamshi Ambati
Date: 15 July 2008

Abtract:
Syntax-based approaches to statistical MT require syntax-aware methods for acquiring their underlying translation models from parallel data. This acquisition process can be driven by syntactic trees for either the source or target language, or by trees on both sides. Work to date has demonstrated that using trees for both sides suffers from severe coverage problems. Approaches that project from trees on one side, on the other hand, have higher levels of recall, but suffer from lower precision, due to the lack of syntactically-aware word alignments.

In this talk I first discuss extraction and the lexical coverage of the translation models learned in both of these scenarios. We will specifically look at how the non-isomorphic nature of the parse trees for the two languages effects recall and coverage. I will then discuss a novel technique for restructuring target parse trees, that generates highly isomorphic target trees that preserve the syntactic boundaries of constituents that were aligned in the original parse trees. I will conclude by discussing some experimental evaluation with an English-French MT System.