Speaker: Matthias Paulik
Title: "Learning from Human Interpreter Speech"
Date: 12 August 2008
Abstract:
Can spoken language translation (SLT) profit from human interpreter speech? In this talk, we explore scenarios which involve live human interpretation, off-line transcription and off-line translation on a massive scale. We consider the deployment of machine translation (MT) and automatic speech recognition (ASR) for the off-line transcription and translation tasks; our systems are trained on 80+ hours of audio data and on parallel text corpora of ~40 million words. To improve performance, we use the available human interpreter speech as an auxiliary information source to bias ASR and MT language models. We evaluate this approach on European Parliament Plenary Session (EPPS) data in three languages (English, Spanish and German), and report preliminary improvements in translation and transcription performance.
Tuesday, August 12, 2008
Tuesday, July 15, 2008
Improving Lexical Coverage of Syntax-driven MT by Re-structuring Non-isomorphic Trees
Speaker: Vamshi Ambati
Date: 15 July 2008
Abtract:
Syntax-based approaches to statistical MT require syntax-aware methods for acquiring their underlying translation models from parallel data. This acquisition process can be driven by syntactic trees for either the source or target language, or by trees on both sides. Work to date has demonstrated that using trees for both sides suffers from severe coverage problems. Approaches that project from trees on one side, on the other hand, have higher levels of recall, but suffer from lower precision, due to the lack of syntactically-aware word alignments.
In this talk I first discuss extraction and the lexical coverage of the translation models learned in both of these scenarios. We will specifically look at how the non-isomorphic nature of the parse trees for the two languages effects recall and coverage. I will then discuss a novel technique for restructuring target parse trees, that generates highly isomorphic target trees that preserve the syntactic boundaries of constituents that were aligned in the original parse trees. I will conclude by discussing some experimental evaluation with an English-French MT System.
Date: 15 July 2008
Abtract:
Syntax-based approaches to statistical MT require syntax-aware methods for acquiring their underlying translation models from parallel data. This acquisition process can be driven by syntactic trees for either the source or target language, or by trees on both sides. Work to date has demonstrated that using trees for both sides suffers from severe coverage problems. Approaches that project from trees on one side, on the other hand, have higher levels of recall, but suffer from lower precision, due to the lack of syntactically-aware word alignments.
In this talk I first discuss extraction and the lexical coverage of the translation models learned in both of these scenarios. We will specifically look at how the non-isomorphic nature of the parse trees for the two languages effects recall and coverage. I will then discuss a novel technique for restructuring target parse trees, that generates highly isomorphic target trees that preserve the syntactic boundaries of constituents that were aligned in the original parse trees. I will conclude by discussing some experimental evaluation with an English-French MT System.
Tuesday, June 10, 2008
Scalable Decoding for Syntax based SMT
Title: Scalable Decoding for Syntax based SMT
Speaker: Abhaya Agarwal
Joint work with Vamshi Ambati
Date: June 10, 2008
Time: 12am - 1:30pm
Location: NSH 3305
Abstract:
Most large scale SMT systems these days use a hybrid model, combining a handful of generative models in a discriminative log-linear framework whose weights are trained on a small amount of development data. Attempts to train fully discriminative MT models that use large number of features, face difficult scalability challenges because it requires repeated decoding of a large training set. In this talk, I will describe a decoder that aims to be efficient enough to allow such training and some initial experiments on training with large number of features.
Speaker: Abhaya Agarwal
Joint work with Vamshi Ambati
Date: June 10, 2008
Time: 12am - 1:30pm
Location: NSH 3305
Abstract:
Most large scale SMT systems these days use a hybrid model, combining a handful of generative models in a discriminative log-linear framework whose weights are trained on a small amount of development data. Attempts to train fully discriminative MT models that use large number of features, face difficult scalability challenges because it requires repeated decoding of a large training set. In this talk, I will describe a decoder that aims to be efficient enough to allow such training and some initial experiments on training with large number of features.
Tuesday, May 13, 2008
Statistical Transfer MT Systems for French and German
Speaker: Greg Hanneman
Date: Tuesday, May 13, at noon
Location: Wean Hall 4623
Title: Statistical Transfer MT Systems for French and German
Abstract: The AVENUE research group's statistical transfer system is a general framework for creating hybrid machine translation systems. It uses two main resources: a weighted synchronous context-free grammar, and a probabilistic bilingual lexicon of syntax-based word- and phrase-level translations. Over the last six months, we have developed new methods for extracting these resources automatically from parsed and word-aligned parallel corpora. In this talk, I will describe the resource-extraction process as it was applied to new French--English and German--English systems for the upcoming ACL workshop on statistical machine translation. Preliminary evaluation results --- both automatic and human-assessed --- will also be reviewed.
Date: Tuesday, May 13, at noon
Location: Wean Hall 4623
Title: Statistical Transfer MT Systems for French and German
Abstract: The AVENUE research group's statistical transfer system is a general framework for creating hybrid machine translation systems. It uses two main resources: a weighted synchronous context-free grammar, and a probabilistic bilingual lexicon of syntax-based word- and phrase-level translations. Over the last six months, we have developed new methods for extracting these resources automatically from parsed and word-aligned parallel corpora. In this talk, I will describe the resource-extraction process as it was applied to new French--English and German--English systems for the upcoming ACL workshop on statistical machine translation. Preliminary evaluation results --- both automatic and human-assessed --- will also be reviewed.
Tuesday, April 22, 2008
Simulating Sentence Pairs Sampling Process via Source and Target Language Models
Speaker: Ngyuen Bach
Abstract: In a traditional word alignment process, each sentence pair is equally assigned an occurrence number, which is normalized during the training to produce the empirical probability. However, some sentences could be more valuable, reliable and appropriate than others. These sentences should therefore have a higher weight in the training. To solve this problem, we explored methods of resampling sentence pairs. We investigated three sets of features: sentence pair confidence (/sc/), genre-dependent sentence pair confidence (/gdsc/) and sentence-dependent phrase alignment confidence (/sdpc/) scores. These features were calculated over an entire training corpus and could easily be integrated into the phrase-based machine translation system.
Abstract: In a traditional word alignment process, each sentence pair is equally assigned an occurrence number, which is normalized during the training to produce the empirical probability. However, some sentences could be more valuable, reliable and appropriate than others. These sentences should therefore have a higher weight in the training. To solve this problem, we explored methods of resampling sentence pairs. We investigated three sets of features: sentence pair confidence (/sc/), genre-dependent sentence pair confidence (/gdsc/) and sentence-dependent phrase alignment confidence (/sdpc/) scores. These features were calculated over an entire training corpus and could easily be integrated into the phrase-based machine translation system.
Wednesday, March 19, 2008
Communicating Unknown Words in Machine Translation
Speaker: Matthias Eck
Title: Communicating Unknown Words in Machine Translation
Abstract:
Unknown words are a major problem for every machine translation system. Regular evaluations and demos do not always show this very well, but in actual communication the lack of specialty vocabulary and named entity translations can seriously affect the communication ability.
A new approach is presented that uses monolingual encyclopedias and dictionaries to "communicate" unknown words. Instead of the actual unknown word, its definition is extracted and translated, which leads to considerable improvements in translation quality.
Title: Communicating Unknown Words in Machine Translation
Abstract:
Unknown words are a major problem for every machine translation system. Regular evaluations and demos do not always show this very well, but in actual communication the lack of specialty vocabulary and named entity translations can seriously affect the communication ability.
A new approach is presented that uses monolingual encyclopedias and dictionaries to "communicate" unknown words. Instead of the actual unknown word, its definition is extracted and translated, which leads to considerable improvements in translation quality.
Tuesday, November 13, 2007
Trees that can help
Speaker: Alok Parlikar
Title: (S (NP (NP Trees) (SBAR (WHNP that) (S (VP can)))) (VP help))
Summary:
For the past two months, I have been working with Alon Lavie and Stephan
Vogel, on Chinese and English parse-trees, to investigate answers to the
following questions:
(a) Can constituency information and word level alignments be used to
align nodes in trees of parallel sentences? How precisely matched
(in meaning) are the yields of these aligned nodes?
(b) Can the parse trees and word-level alignments be used for learning
reordering rules? If we use these rules to reorder source sentences,
can we do any better at translation?
The current results show that:
(a) - Node Alignments from hand-aligned data are very precise.
- Using automatic word-alignments to align nodes gives over 70%
precision and over 40% recall.
(b) Using a 10-best reordering of words in the source sentences, with
a "dumb" reordering strategy has shown a 0.005 improvement in BLEU
score.
I would like to talk about the approaches that we have taken here, and to
discuss about strategies for improving these results.
Title: (S (NP (NP Trees) (SBAR (WHNP that) (S (VP can)))) (VP help))
Summary:
For the past two months, I have been working with Alon Lavie and Stephan
Vogel, on Chinese and English parse-trees, to investigate answers to the
following questions:
(a) Can constituency information and word level alignments be used to
align nodes in trees of parallel sentences? How precisely matched
(in meaning) are the yields of these aligned nodes?
(b) Can the parse trees and word-level alignments be used for learning
reordering rules? If we use these rules to reorder source sentences,
can we do any better at translation?
The current results show that:
(a) - Node Alignments from hand-aligned data are very precise.
- Using automatic word-alignments to align nodes gives over 70%
precision and over 40% recall.
(b) Using a 10-best reordering of words in the source sentences, with
a "dumb" reordering strategy has shown a 0.005 improvement in BLEU
score.
I would like to talk about the approaches that we have taken here, and to
discuss about strategies for improving these results.
Subscribe to:
Posts (Atom)