Monday, December 13, 2010

Efficient Language Model Inference - Kenneth Heafield

Title: Efficient Language Model Inference
Who? Kenneth Heafield
When? Tuesday, December 21 @ Noon
Where? GHC 4405

In GHC 4405 at noon on Tuesday Dec 21, I will give a speaking
requirement talk on Efficient Language Model Inference. As this is also
a MT Lunch, there will be free lunch.

If you're using SRILM, come to my talk to reduce your memory consumption
by 86% while reducing CPU time by 16%. Users of IRSTLM should come for
the same reason; the code uses 42% less memory and 19% less CPU.

Language models are an important feature in speech, translation,
generation, IR, and other technologies. More training data and less
pruning generally lead to higher quality, but RAM is a limiting factor.
Further, systems consult language models so frequently that lookups
dominate CPU time.

This talk presents language modeling code with several optimizations to
improve time and space performance. Storing backoff information in
feature state reduces redundant lookups. Constructing known
distributions and biasing binary search speeds search and reduces page
faults. Memory mapping reduces load time. Bit level packing increases
locality. Stronger filtering removes n-grams that cannot be assembled
during decoding due to phrase and sentence constraints. The code is
currently integrated into Moses and being integrated into cdec and
Joshua. I will cover how my code works and how to use it in other

Wednesday, October 6, 2010

Choosing the Right Evaluation for Machine Translation

Time: Noon on Tuesday, October 12
Place: GHC 6501 (usual location)

Title: Choosing the Right Evaluation for Machine Translation: an Examination of Annotator and Automatic Metric Performance on Human Judgment Tasks
Authors: Michael Denkowski and Alon Lavie

This work examines the motivation, design, and practical results of several types of human evaluation tasks for machine translation. In addition to considering annotator performance and task informativeness over multiple evaluations, we explore the practicality of tuning automatic evaluation metrics to each judgment type in a comprehensive experiment using the METEOR metric. We present results showing clear advantages of tuning to certain types of judgments and discuss causes of inconsistency when tuning to various judgment data, as well as sources of difficulty in the human evaluation tasks themselves.

This work will be presented at AMTA 2010.

Monday, September 13, 2010

Models for Synchronous Grammar Induction for Statistical Machine Translation

Title: Models for Synchronous Grammar Induction for Statistical Machine Translation
Presenters: Chris Dyer, LTI& Desai Chen, CSD undergraduate
Tuesday, September 14, at Noon to 1:30pm in GHC 6501.

Abstract: The last decade of research in Statistical Machine Translation (SMT) has seen rapid progress. The most successful methods have been based on synchronous context free grammars (SCFGs), which encode translational equivalences and license reordering between tokens in the source and target languages. Yet, while closely related language pairs can be translated with a high degree of precision now, the result for distant pairs is far from acceptable. In theory, however, the "right"' SCFG is capable of handling most, if not all, structurally divergent language pairs. This talk will report on the results of the 2010 Language Engineering Workshop held at Johns Hopkins University that the goal to focus on the crucial practical aspects of acquiring such SCFGs from bilingual, but otherwise unannotated, text. We started with existing algorithms for inducing unlabeled SCFGs (e.g. the popular Hiero model) and then used unsupervised learning methods to refine the syntactic constituents used in the translation rules of the grammar.

Monday, June 28, 2010

Syntax-to-Morphology Mapping in Factored Phrase-Based Statistical Machine Translation from English to Turkish

Tuesday, June 29 at Noon, in GHC 6501

Title: Syntax-to-Morphology Mapping in Factored Phrase-Based Statistical
Machine Translation from English to Turkish

Authors: Reyyan Yeniterzi and Kemal Oflazer


We present a novel scheme to apply factored phrase-based SMT to a
language pair with very disparate morphological structures. Our
approach relies on syntactic analysis on the source side (English) and
then encodes a wide variety of local and non-local syntactic
structures as complex structural tags which appear as additional
factors in the training data. On the target side (Turkish), we only
perform morphological analysis and disambiguation but treat the
complete complex morphological tag as a factor, instead of separating
morphemes. We incrementally explore capturing various syntactic
substructures as complex tags on the English side, and evaluate how
our translations improve in BLEU scores. Our maximal set of source and
target side transformations, coupled with some additional techniques,
provide an 39\% relative improvement from a baseline 17.08 to 23.78
BLEU, all averaged over 10 training and test sets. Now that the
syntactic analysis on the English side is available, we also
experiment with more long distance constituent reordering to bring the
English constituent order close to Turkish, but find that these
transformations do not provide any additional consistent tangible
gains when averaged over the 10 sets.

Wednesday, May 12, 2010

Chunk-Based EBMT

When: noon on May 18
Where: GHC 6501

Speaker: Jaedy Kim

Topic: Chunk-Based EBMT

Abstract: Corpus driven machine translation approaches such as Phrase-Based Statistical Machine Translation and Example-Based Machine Translation have been successful by using word alignment to find translation fragments for matched source parts in a bilingual training corpus.
However, they still cannot properly deal with systematic translation for insertion or deletion words between two distant languages.
In this work, we used syntactic chunks as translation units to alleviate this problem, improve alignments and show improvement in BLEU for Korean to English and Chinese to English translation tasks.

Monday, April 12, 2010

Generalized templates for EBMT

Speaker: Rashmi Gangadharaiah
Location: GHC 6501

Topic: Generalized templates for EBMT

Example-Based Machine Translation (EBMT), like other corpus based methods, requires substantial parallel training data. One way to reduce data requirements and improve translation quality is to generalize parts of the parallel corpus into translation templates. This automated generalization process requires clustering. In most clustering approaches the optimal number of clusters (N) is found empirically on a development set which often takes several days. We introduce a spectral clustering framework that automatically estimates the optimal N and removes unstable oscillating points. The new framework produces significant improvements in low-resource EBMT settings for English-to-French (~1.4 BLEU points), English-to-Chinese (~1 BLEU point), and English-to-Haitian (~2 BLEU points). The translation quality with templates created using automatically and empirically found best N were almost the same. By discarding “incoherent” points, a further boost in translation scores is observed, even above the empirically found N.

Monday, March 15, 2010

Two talks

(1) Greg Hanneman:
Title: The Stat-XFER Group Submission for WMT '10

Each year, the Workshop in Statistical Machine Translation collects state-of-the-art MT results for a variety of European language pairs via a shared translation task. In this talk, I will describe the CMU's Stat-XFER MT group submission to this year's WMT French--English track, our third submission to the WMT series, using the Joshua decoder. A large focus will be on new modeling decisions or system-building techniques that have changed from eariler submissions based on new research carried out in our group. I will also present some open questions facing builders of large-scale hierarchcial MT systems in general.

(2) Vamshi Ambati:
Title: Making sense of Crowd data for Machine Translation

Quality of crowd data is a common concern in crowd-sourcing approaches to data collection. When working with crowd data, the objectives are two-fold - maximizing the quality of data from non-experts, and minimizing the cost of annotation by pruning noisy annotators.
I will discuss our recent experiments in Machine Translation for selection of high quality crowd translations by explicitly modeling annotator reliability based on agreement with other submissions. I will also present some preliminary results in cost minimization and report their adaptation and feasibility to machine translation.

Thursday, February 18, 2010

Nonparametric Word Segmentation for Machine Translation

Speaker: Thuylinh Nguyen
Title: Nonparametric Word Segmentation for Machine Translation
Thursday 18 Feb 2010. 12-1:30pm in GHC 4405.

In this talk we present an unsupervised word segmentation for machine
translation. The model utilizes existing nonparametric monolingual
segmentations. The monolingual segmentation model and the bilingual word
alignment model are coupled so that source text segmentation optimizes
the one-to-one mapping with the target text. Often, there are words in
the source language that do not appear in target language and vise
versa. Our model therefore models source language word deletion and word
insertion. The experiments show improvements on Arabic-English and
Chinese-English translation tasks.

Wednesday, January 13, 2010

LoonyBin: Making Empirical MT Reproducible, Efficient, and Less Annoying

Speaker: Jonathan Clark
When: Tuesday, January 19 at Noon
Where: GHC 6501
What: Free Knowledge and Free Food
Title: LoonyBin: Making Empirical MT Reproducible, Efficient, and
Less Annoying

Abstract: Construction of machine translation systems has evolved into
a multi-stage workflow involving many complicated dependencies. Many
decoder distributions have addressed this by including monolithic
training scripts – for Moses and
for SAMT. However, such scripts can be tricky to modify for novel
experiments and typically have limited support for the variety of job
schedulers found on academic and commercial computer clusters. Further
complicating these systems are hyperparameters, which often cannot be
directly optimized by conventional methods requiring users to
determine which combination of values is best via trial and error. The
recently-released LoonyBin open-source workflow management tool
addresses these issues by providing: 1) a visual interface for the
user to create and modify workflows; 2) a well-defined logging
mechanism; 3) a script generator that compiles visual workflows into
shell scripts, and 4) the concept of Hyperworkflows, which intuitively
and succinctly encodes small experimental variations within a larger
workflow. We also describe the Machine Translation Toolpack for
LoonyBin, which exposes state-of-the-art machine translation tools as
drag-and-drop components within LoonyBin.