Wednesday, September 7, 2011

Training Machine Translation with a Second-Order Taylor Approximation of Weighted Translation Instances

Title: Training Machine Translation with a Second-Order Taylor Approximation of Weighted Translation Instances
Speaker: Aaron Phillips
When: Tuesday, September 13, 12:00 Noon to 1:00pm.
Where: GHC 6501

Abstract: The Cunei Machine Translation Platform is an open-source MT system designed to model instances of translation. One of the challenges to this approach is effective training. We describe two techniques that improve the training procedure and allow us to leverage the strengths of instance-based modeling. First, during training we approximate our model with a second-order Taylor series. Second, we discount models based on the magnitude of their approximation. By reducing error in training, our model now consistently outperforms the standard SMT model with gains ranging from 0.51 to 3.77 BLEU on German-English and Czech-English test sets.

Monday, May 16, 2011

Syntax-to-Morphology Mapping in Factored Phrase-Based SMT (English and Turkish)

Title: Syntax-to-Morphology Mapping in Factored Phrase-Based
Statistical Machine Translation between English and Turkish

Speaker: Reyyan Yeniterzi

When: Tuesday, May 17 at 12:15pm
Where: GHC 6501

Abstract:

Motivated by the observation that many local and some nonlocal
syntactic structures in English essentially map to morphologically
complex words in Turkish, a new approach which is called
syntax-to-morphology mapping was introduced recently (Yeniterzi and
Oflazer, 2010). This approach maps syntactic structures in English to
complex words in Turkish directly. It mainly recognizes certain local
and nonlocal syntactic structures on the English side and packages
those structures and attach to heads to obtain parallel morphological
structures.

With the help of this method, one can identify and reorganize phrases
on the English side, to align English syntax to Turkish morphology.
Furthermore with this method, continuous and discontinuous variants of
certain (syntactic) source phrases can be conflated during the SMT
phrase extraction process. Since most function words encoding syntax
are now abstracted into complex tags, the length of the English
sentences can be dramatically reduced.

The initial experiments were performed on English-to-Turkish SMT
system. In this project, we built upon this initial system by doing
lexical reordering and data augmentation. Furthermore we also applied
syntax-to-morphology mapping to a Turkish-to-English SMT system for
the first time.

This is joint work with Kemal Oflazer from Qatar CMU. It was presented
in the Machine Translation and Morphologically-rich Languages Research
Workshop at Haifa, Israel in January, 2011.

Tuesday, May 3, 2011

Better Hypothesis Testing for Statistical Machine Translation: Controlling for Optimizer Instability

Title: Better Hypothesis Testing for Statistical Machine Translation:
Controlling for Optimizer Instability
Speaker: Jonathan Clark
When: Tuesday, 4/19 at Noon

Abstract:
In statistical machine translation, a researcher seeks to determine
whether some innovation (e.g., a new feature, model, or inference
algorithm) improves translation quality in comparison to a baseline
system. To answer this question, he runs an experiment to evaluate the
behavior of the two systems on held-out data. In this paper, we
consider how to make such experiments more statistically reliable. We
provide a systematic analysis of the effects of optimizer instability
(an extraneous variable that is seldom controlled for) on experimental
outcomes, and make recommendations for reporting results more
accurately.

This is joint work with Chris Dyer, Alon Lavie, and Noah Smith. It was
recently accepted for publication as an ACL short paper.

Wednesday, March 2, 2011

Qin Gao: Expanding parallel corpora for machine translation

Speaker: Qin Gao
When: at noon, March 8, 2011
Where: GHC 4405

We present an approach of expanding parallel corpora for machine translation. By utilizing Semantic role labeling (SRL) on one side of the language pair, we extract SRL substitution rules from existing parallel corpus. The rules are then used for generating new sentence pairs. An SVM classifier is built to filter the generated sentence pairs. The filtered corpus is used for training phrase-based translation models, which can be used directly in translation tasks or combined with baseline models. Experiment results on Chinese-English machine translation tasks show an average improvement of 0.45 BLEU and 1.22 TER points across 5 different NIST test sets.

Thursday, January 13, 2011

Machine Translation and Computer-Assisted Translation

Title: Prospects for Integrating Machine Translation and Computer-Assisted Translation in the Translation Industry

Speaker: Gregory M. Shreve from the Department of Modern and Classical Language Studies at Kent State University and colleagues
Location: GHC 6115
Time: 12:30 pm, 14 Jan 2011


The speaker's CV can be found at http://www.kent.edu/mcls/faculty/mcls_shreve.cfm.

Monday, December 13, 2010

Efficient Language Model Inference - Kenneth Heafield

Title: Efficient Language Model Inference
Who? Kenneth Heafield
When? Tuesday, December 21 @ Noon
Where? GHC 4405


In GHC 4405 at noon on Tuesday Dec 21, I will give a speaking
requirement talk on Efficient Language Model Inference. As this is also
a MT Lunch, there will be free lunch.

If you're using SRILM, come to my talk to reduce your memory consumption
by 86% while reducing CPU time by 16%. Users of IRSTLM should come for
the same reason; the code uses 42% less memory and 19% less CPU.

Language models are an important feature in speech, translation,
generation, IR, and other technologies. More training data and less
pruning generally lead to higher quality, but RAM is a limiting factor.
Further, systems consult language models so frequently that lookups
dominate CPU time.

This talk presents language modeling code with several optimizations to
improve time and space performance. Storing backoff information in
feature state reduces redundant lookups. Constructing known
distributions and biasing binary search speeds search and reduces page
faults. Memory mapping reduces load time. Bit level packing increases
locality. Stronger filtering removes n-grams that cannot be assembled
during decoding due to phrase and sentence constraints. The code is
currently integrated into Moses and being integrated into cdec and
Joshua. I will cover how my code works and how to use it in other
decoders.

Wednesday, October 6, 2010

Choosing the Right Evaluation for Machine Translation

Time: Noon on Tuesday, October 12
Place: GHC 6501 (usual location)

Title: Choosing the Right Evaluation for Machine Translation: an Examination of Annotator and Automatic Metric Performance on Human Judgment Tasks
Authors: Michael Denkowski and Alon Lavie

Abstract:
This work examines the motivation, design, and practical results of several types of human evaluation tasks for machine translation. In addition to considering annotator performance and task informativeness over multiple evaluations, we explore the practicality of tuning automatic evaluation metrics to each judgment type in a comprehensive experiment using the METEOR metric. We present results showing clear advantages of tuning to certain types of judgments and discuss causes of inconsistency when tuning to various judgment data, as well as sources of difficulty in the human evaluation tasks themselves.

This work will be presented at AMTA 2010.