Tuesday, June 29 at Noon, in GHC 6501
Title: Syntax-to-Morphology Mapping in Factored Phrase-Based Statistical
Machine Translation from English to Turkish
Authors: Reyyan Yeniterzi and Kemal Oflazer
Abstract:
We present a novel scheme to apply factored phrase-based SMT to a
language pair with very disparate morphological structures. Our
approach relies on syntactic analysis on the source side (English) and
then encodes a wide variety of local and non-local syntactic
structures as complex structural tags which appear as additional
factors in the training data. On the target side (Turkish), we only
perform morphological analysis and disambiguation but treat the
complete complex morphological tag as a factor, instead of separating
morphemes. We incrementally explore capturing various syntactic
substructures as complex tags on the English side, and evaluate how
our translations improve in BLEU scores. Our maximal set of source and
target side transformations, coupled with some additional techniques,
provide an 39\% relative improvement from a baseline 17.08 to 23.78
BLEU, all averaged over 10 training and test sets. Now that the
syntactic analysis on the English side is available, we also
experiment with more long distance constituent reordering to bring the
English constituent order close to Turkish, but find that these
transformations do not provide any additional consistent tangible
gains when averaged over the 10 sets.
Monday, June 28, 2010
Wednesday, May 12, 2010
Chunk-Based EBMT
When: noon on May 18
Where: GHC 6501
Speaker: Jaedy Kim
Topic: Chunk-Based EBMT
Abstract: Corpus driven machine translation approaches such as Phrase-Based Statistical Machine Translation and Example-Based Machine Translation have been successful by using word alignment to find translation fragments for matched source parts in a bilingual training corpus.
However, they still cannot properly deal with systematic translation for insertion or deletion words between two distant languages.
In this work, we used syntactic chunks as translation units to alleviate this problem, improve alignments and show improvement in BLEU for Korean to English and Chinese to English translation tasks.
Where: GHC 6501
Speaker: Jaedy Kim
Topic: Chunk-Based EBMT
Abstract: Corpus driven machine translation approaches such as Phrase-Based Statistical Machine Translation and Example-Based Machine Translation have been successful by using word alignment to find translation fragments for matched source parts in a bilingual training corpus.
However, they still cannot properly deal with systematic translation for insertion or deletion words between two distant languages.
In this work, we used syntactic chunks as translation units to alleviate this problem, improve alignments and show improvement in BLEU for Korean to English and Chinese to English translation tasks.
Monday, April 12, 2010
Generalized templates for EBMT
Speaker: Rashmi Gangadharaiah
Location: GHC 6501
Topic: Generalized templates for EBMT
Abstract:
-----------
Example-Based Machine Translation (EBMT), like other corpus based methods, requires substantial parallel training data. One way to reduce data requirements and improve translation quality is to generalize parts of the parallel corpus into translation templates. This automated generalization process requires clustering. In most clustering approaches the optimal number of clusters (N) is found empirically on a development set which often takes several days. We introduce a spectral clustering framework that automatically estimates the optimal N and removes unstable oscillating points. The new framework produces significant improvements in low-resource EBMT settings for English-to-French (~1.4 BLEU points), English-to-Chinese (~1 BLEU point), and English-to-Haitian (~2 BLEU points). The translation quality with templates created using automatically and empirically found best N were almost the same. By discarding “incoherent” points, a further boost in translation scores is observed, even above the empirically found N.
Location: GHC 6501
Topic: Generalized templates for EBMT
Abstract:
-----------
Example-Based Machine Translation (EBMT), like other corpus based methods, requires substantial parallel training data. One way to reduce data requirements and improve translation quality is to generalize parts of the parallel corpus into translation templates. This automated generalization process requires clustering. In most clustering approaches the optimal number of clusters (N) is found empirically on a development set which often takes several days. We introduce a spectral clustering framework that automatically estimates the optimal N and removes unstable oscillating points. The new framework produces significant improvements in low-resource EBMT settings for English-to-French (~1.4 BLEU points), English-to-Chinese (~1 BLEU point), and English-to-Haitian (~2 BLEU points). The translation quality with templates created using automatically and empirically found best N were almost the same. By discarding “incoherent” points, a further boost in translation scores is observed, even above the empirically found N.
Monday, March 15, 2010
Two talks
(1) Greg Hanneman:
Title: The Stat-XFER Group Submission for WMT '10
Abstract:
Each year, the Workshop in Statistical Machine Translation collects state-of-the-art MT results for a variety of European language pairs via a shared translation task. In this talk, I will describe the CMU's Stat-XFER MT group submission to this year's WMT French--English track, our third submission to the WMT series, using the Joshua decoder. A large focus will be on new modeling decisions or system-building techniques that have changed from eariler submissions based on new research carried out in our group. I will also present some open questions facing builders of large-scale hierarchcial MT systems in general.
(2) Vamshi Ambati:
Title: Making sense of Crowd data for Machine Translation
Abstract:
Quality of crowd data is a common concern in crowd-sourcing approaches to data collection. When working with crowd data, the objectives are two-fold - maximizing the quality of data from non-experts, and minimizing the cost of annotation by pruning noisy annotators.
I will discuss our recent experiments in Machine Translation for selection of high quality crowd translations by explicitly modeling annotator reliability based on agreement with other submissions. I will also present some preliminary results in cost minimization and report their adaptation and feasibility to machine translation.
Title: The Stat-XFER Group Submission for WMT '10
Abstract:
Each year, the Workshop in Statistical Machine Translation collects state-of-the-art MT results for a variety of European language pairs via a shared translation task. In this talk, I will describe the CMU's Stat-XFER MT group submission to this year's WMT French--English track, our third submission to the WMT series, using the Joshua decoder. A large focus will be on new modeling decisions or system-building techniques that have changed from eariler submissions based on new research carried out in our group. I will also present some open questions facing builders of large-scale hierarchcial MT systems in general.
(2) Vamshi Ambati:
Title: Making sense of Crowd data for Machine Translation
Abstract:
Quality of crowd data is a common concern in crowd-sourcing approaches to data collection. When working with crowd data, the objectives are two-fold - maximizing the quality of data from non-experts, and minimizing the cost of annotation by pruning noisy annotators.
I will discuss our recent experiments in Machine Translation for selection of high quality crowd translations by explicitly modeling annotator reliability based on agreement with other submissions. I will also present some preliminary results in cost minimization and report their adaptation and feasibility to machine translation.
Thursday, February 18, 2010
Nonparametric Word Segmentation for Machine Translation
Speaker: Thuylinh Nguyen
Title: Nonparametric Word Segmentation for Machine Translation
Thursday 18 Feb 2010. 12-1:30pm in GHC 4405.
In this talk we present an unsupervised word segmentation for machine
translation. The model utilizes existing nonparametric monolingual
segmentations. The monolingual segmentation model and the bilingual word
alignment model are coupled so that source text segmentation optimizes
the one-to-one mapping with the target text. Often, there are words in
the source language that do not appear in target language and vise
versa. Our model therefore models source language word deletion and word
insertion. The experiments show improvements on Arabic-English and
Chinese-English translation tasks.
Title: Nonparametric Word Segmentation for Machine Translation
Thursday 18 Feb 2010. 12-1:30pm in GHC 4405.
In this talk we present an unsupervised word segmentation for machine
translation. The model utilizes existing nonparametric monolingual
segmentations. The monolingual segmentation model and the bilingual word
alignment model are coupled so that source text segmentation optimizes
the one-to-one mapping with the target text. Often, there are words in
the source language that do not appear in target language and vise
versa. Our model therefore models source language word deletion and word
insertion. The experiments show improvements on Arabic-English and
Chinese-English translation tasks.
Wednesday, January 13, 2010
LoonyBin: Making Empirical MT Reproducible, Efficient, and Less Annoying
Speaker: Jonathan Clark
When: Tuesday, January 19 at Noon
Where: GHC 6501
What: Free Knowledge and Free Food
Title: LoonyBin: Making Empirical MT Reproducible, Efficient, and
Less Annoying
Abstract: Construction of machine translation systems has evolved into
a multi-stage workflow involving many complicated dependencies. Many
decoder distributions have addressed this by including monolithic
training scripts – train-factored-model.pl for Moses and mr_runmer.pl
for SAMT. However, such scripts can be tricky to modify for novel
experiments and typically have limited support for the variety of job
schedulers found on academic and commercial computer clusters. Further
complicating these systems are hyperparameters, which often cannot be
directly optimized by conventional methods requiring users to
determine which combination of values is best via trial and error. The
recently-released LoonyBin open-source workflow management tool
addresses these issues by providing: 1) a visual interface for the
user to create and modify workflows; 2) a well-defined logging
mechanism; 3) a script generator that compiles visual workflows into
shell scripts, and 4) the concept of Hyperworkflows, which intuitively
and succinctly encodes small experimental variations within a larger
workflow. We also describe the Machine Translation Toolpack for
LoonyBin, which exposes state-of-the-art machine translation tools as
drag-and-drop components within LoonyBin.
When: Tuesday, January 19 at Noon
Where: GHC 6501
What: Free Knowledge and Free Food
Title: LoonyBin: Making Empirical MT Reproducible, Efficient, and
Less Annoying
Abstract: Construction of machine translation systems has evolved into
a multi-stage workflow involving many complicated dependencies. Many
decoder distributions have addressed this by including monolithic
training scripts – train-factored-model.pl for Moses and mr_runmer.pl
for SAMT. However, such scripts can be tricky to modify for novel
experiments and typically have limited support for the variety of job
schedulers found on academic and commercial computer clusters. Further
complicating these systems are hyperparameters, which often cannot be
directly optimized by conventional methods requiring users to
determine which combination of values is best via trial and error. The
recently-released LoonyBin open-source workflow management tool
addresses these issues by providing: 1) a visual interface for the
user to create and modify workflows; 2) a well-defined logging
mechanism; 3) a script generator that compiles visual workflows into
shell scripts, and 4) the concept of Hyperworkflows, which intuitively
and succinctly encodes small experimental variations within a larger
workflow. We also describe the Machine Translation Toolpack for
LoonyBin, which exposes state-of-the-art machine translation tools as
drag-and-drop components within LoonyBin.
Wednesday, December 9, 2009
MEMT and METEOR
Kenneth Heafield and Michael Denkowski: Features for System Combination
(This is work done as an MT lab project.)
Michael will give an update on his recent work on the METEOR MT evaluation matrix.
10 Dec 2009, Thursday, 12:00-1:30, in GHC 6501
(This is work done as an MT lab project.)
Michael will give an update on his recent work on the METEOR MT evaluation matrix.
10 Dec 2009, Thursday, 12:00-1:30, in GHC 6501
Subscribe to:
Posts (Atom)