(1) Greg Hanneman:
Title: The Stat-XFER Group Submission for WMT '10
Abstract:
Each year, the Workshop in Statistical Machine Translation collects state-of-the-art MT results for a variety of European language pairs via a shared translation task. In this talk, I will describe the CMU's Stat-XFER MT group submission to this year's WMT French--English track, our third submission to the WMT series, using the Joshua decoder. A large focus will be on new modeling decisions or system-building techniques that have changed from eariler submissions based on new research carried out in our group. I will also present some open questions facing builders of large-scale hierarchcial MT systems in general.
(2) Vamshi Ambati:
Title: Making sense of Crowd data for Machine Translation
Abstract:
Quality of crowd data is a common concern in crowd-sourcing approaches to data collection. When working with crowd data, the objectives are two-fold - maximizing the quality of data from non-experts, and minimizing the cost of annotation by pruning noisy annotators.
I will discuss our recent experiments in Machine Translation for selection of high quality crowd translations by explicitly modeling annotator reliability based on agreement with other submissions. I will also present some preliminary results in cost minimization and report their adaptation and feasibility to machine translation.
Monday, March 15, 2010
Thursday, February 18, 2010
Nonparametric Word Segmentation for Machine Translation
Speaker: Thuylinh Nguyen
Title: Nonparametric Word Segmentation for Machine Translation
Thursday 18 Feb 2010. 12-1:30pm in GHC 4405.
In this talk we present an unsupervised word segmentation for machine
translation. The model utilizes existing nonparametric monolingual
segmentations. The monolingual segmentation model and the bilingual word
alignment model are coupled so that source text segmentation optimizes
the one-to-one mapping with the target text. Often, there are words in
the source language that do not appear in target language and vise
versa. Our model therefore models source language word deletion and word
insertion. The experiments show improvements on Arabic-English and
Chinese-English translation tasks.
Title: Nonparametric Word Segmentation for Machine Translation
Thursday 18 Feb 2010. 12-1:30pm in GHC 4405.
In this talk we present an unsupervised word segmentation for machine
translation. The model utilizes existing nonparametric monolingual
segmentations. The monolingual segmentation model and the bilingual word
alignment model are coupled so that source text segmentation optimizes
the one-to-one mapping with the target text. Often, there are words in
the source language that do not appear in target language and vise
versa. Our model therefore models source language word deletion and word
insertion. The experiments show improvements on Arabic-English and
Chinese-English translation tasks.
Wednesday, January 13, 2010
LoonyBin: Making Empirical MT Reproducible, Efficient, and Less Annoying
Speaker: Jonathan Clark
When: Tuesday, January 19 at Noon
Where: GHC 6501
What: Free Knowledge and Free Food
Title: LoonyBin: Making Empirical MT Reproducible, Efficient, and
Less Annoying
Abstract: Construction of machine translation systems has evolved into
a multi-stage workflow involving many complicated dependencies. Many
decoder distributions have addressed this by including monolithic
training scripts – train-factored-model.pl for Moses and mr_runmer.pl
for SAMT. However, such scripts can be tricky to modify for novel
experiments and typically have limited support for the variety of job
schedulers found on academic and commercial computer clusters. Further
complicating these systems are hyperparameters, which often cannot be
directly optimized by conventional methods requiring users to
determine which combination of values is best via trial and error. The
recently-released LoonyBin open-source workflow management tool
addresses these issues by providing: 1) a visual interface for the
user to create and modify workflows; 2) a well-defined logging
mechanism; 3) a script generator that compiles visual workflows into
shell scripts, and 4) the concept of Hyperworkflows, which intuitively
and succinctly encodes small experimental variations within a larger
workflow. We also describe the Machine Translation Toolpack for
LoonyBin, which exposes state-of-the-art machine translation tools as
drag-and-drop components within LoonyBin.
When: Tuesday, January 19 at Noon
Where: GHC 6501
What: Free Knowledge and Free Food
Title: LoonyBin: Making Empirical MT Reproducible, Efficient, and
Less Annoying
Abstract: Construction of machine translation systems has evolved into
a multi-stage workflow involving many complicated dependencies. Many
decoder distributions have addressed this by including monolithic
training scripts – train-factored-model.pl for Moses and mr_runmer.pl
for SAMT. However, such scripts can be tricky to modify for novel
experiments and typically have limited support for the variety of job
schedulers found on academic and commercial computer clusters. Further
complicating these systems are hyperparameters, which often cannot be
directly optimized by conventional methods requiring users to
determine which combination of values is best via trial and error. The
recently-released LoonyBin open-source workflow management tool
addresses these issues by providing: 1) a visual interface for the
user to create and modify workflows; 2) a well-defined logging
mechanism; 3) a script generator that compiles visual workflows into
shell scripts, and 4) the concept of Hyperworkflows, which intuitively
and succinctly encodes small experimental variations within a larger
workflow. We also describe the Machine Translation Toolpack for
LoonyBin, which exposes state-of-the-art machine translation tools as
drag-and-drop components within LoonyBin.
Wednesday, December 9, 2009
MEMT and METEOR
Kenneth Heafield and Michael Denkowski: Features for System Combination
(This is work done as an MT lab project.)
Michael will give an update on his recent work on the METEOR MT evaluation matrix.
10 Dec 2009, Thursday, 12:00-1:30, in GHC 6501
(This is work done as an MT lab project.)
Michael will give an update on his recent work on the METEOR MT evaluation matrix.
10 Dec 2009, Thursday, 12:00-1:30, in GHC 6501
Monday, November 9, 2009
Lori's talk
Speaker: Lori Levin
Where: GHC 6501
When: Nov 09, 2009 - Tuesday - Noon
Title: A Pendulum Swung Too Far
Abstract: This paper by Ken Church deals with the never ending battle between Empiricism and Rationalism,
esp. its incarnation in NLP. Lori will summarize and present the arguments formulated in the
paper. She will then continue with her own views on why linguistics
needs to be brought back into NLP and MT in particular.
Where: GHC 6501
When: Nov 09, 2009 - Tuesday - Noon
Title: A Pendulum Swung Too Far
Abstract: This paper by Ken Church deals with the never ending battle between Empiricism and Rationalism,
esp. its incarnation in NLP. Lori will summarize and present the arguments formulated in the
paper. She will then continue with her own views on why linguistics
needs to be brought back into NLP and MT in particular.
Monday, August 10, 2009
Two talks
Talk 1:
Nguyen Bach: Source-side Dependency Tree Reordering Models with Subtree Movements and Constraints
Abstract: We propose a novel source-side dependency tree reordering model for statistical machine translation, in which subtree movements and constraints are represented as reordering events associated with the widely used lexicalized reordering models. This model allows us to not only efficiently capture the statistical distribution of the subtree-to-subtree transitions in training data, but also utilize it directly at the decoding time to guide the search process. Using subtree movements and constraints as features in a log-linear model, we are able to help the reordering models make better selections. It also allows the subtle importance of monolingual syntactic movements to be learned alongside other reordering features. We show improvements in translation quality in English-Spanish and English-Iraqi translation tasks.
This is joint work with Qin Gao and Stephan Vogel.
Talk 2:
Francisco (Paco) Guzman: Reassessment of the Role of Phrase Extraction in SMT
Abstract: In this paper we study in detail the relation between word alignment and phrase extraction. First, we analyze different word alignments according to several characteristics and compare them to hand-aligned data. Then, we analyze the phrase-pairs generated by these alignments. We observed that the number of unaligned words has a large impact on the characteristics of the phrase table. A manual evaluation of phrase pair quality showed that the increase in the number of unaligned words results in a lower quality. Finally, we present translation results from using the number of unaligned words as features from which we obtain up to 2BP of improvement.
This is joint work with Qin Gao and Stephan Vogel.
Nguyen Bach: Source-side Dependency Tree Reordering Models with Subtree Movements and Constraints
Abstract: We propose a novel source-side dependency tree reordering model for statistical machine translation, in which subtree movements and constraints are represented as reordering events associated with the widely used lexicalized reordering models. This model allows us to not only efficiently capture the statistical distribution of the subtree-to-subtree transitions in training data, but also utilize it directly at the decoding time to guide the search process. Using subtree movements and constraints as features in a log-linear model, we are able to help the reordering models make better selections. It also allows the subtle importance of monolingual syntactic movements to be learned alongside other reordering features. We show improvements in translation quality in English-Spanish and English-Iraqi translation tasks.
This is joint work with Qin Gao and Stephan Vogel.
Talk 2:
Francisco (Paco) Guzman: Reassessment of the Role of Phrase Extraction in SMT
Abstract: In this paper we study in detail the relation between word alignment and phrase extraction. First, we analyze different word alignments according to several characteristics and compare them to hand-aligned data. Then, we analyze the phrase-pairs generated by these alignments. We observed that the number of unaligned words has a large impact on the characteristics of the phrase table. A manual evaluation of phrase pair quality showed that the increase in the number of unaligned words results in a lower quality. Finally, we present translation results from using the number of unaligned words as features from which we obtain up to 2BP of improvement.
This is joint work with Qin Gao and Stephan Vogel.
Monday, June 15, 2009
Making Disfluent Output Slightly Less So: MT System Combination Search Spaces and Optimization
Speaker: Kenneth Heafield
Title: Making Disfluent Output Slightly Less So:
MT System Combination Search Spaces and Optimization
Abstract: System combination merges several machine translation outputs
into a single improved sentence. This talk starts by summarizing the
approach including, a search space derived from the alignments, and
hypothesis scoring. The current search space focuses on picking words
in a roughly word synchronous way. Another search space under development
builds a directed graph in which aligned words correspond to a vertex and
each bigram corresponds to a directed edge. Search is conducted much like
a left-to-right MT decoder. Speed optimizations, which allow decoding at
5.5 sentences per second, apply to other MT systems in the areas of
duplicate handling, language model state, and multithreading. This speed
allows me to find hyperparameters by searching hundreds of parameter
combinations, each with a full round of tuning. In preparation for
last Friday's NIST submission, system combination improved 2.4 BLEU
points over the best component system for Urdu to English translation.
Title: Making Disfluent Output Slightly Less So:
MT System Combination Search Spaces and Optimization
Abstract: System combination merges several machine translation outputs
into a single improved sentence. This talk starts by summarizing the
approach including, a search space derived from the alignments, and
hypothesis scoring. The current search space focuses on picking words
in a roughly word synchronous way. Another search space under development
builds a directed graph in which aligned words correspond to a vertex and
each bigram corresponds to a directed edge. Search is conducted much like
a left-to-right MT decoder. Speed optimizations, which allow decoding at
5.5 sentences per second, apply to other MT systems in the areas of
duplicate handling, language model state, and multithreading. This speed
allows me to find hyperparameters by searching hundreds of parameter
combinations, each with a full round of tuning. In preparation for
last Friday's NIST submission, system combination improved 2.4 BLEU
points over the best component system for Urdu to English translation.
Subscribe to:
Posts (Atom)