Monday, June 28, 2010

Syntax-to-Morphology Mapping in Factored Phrase-Based Statistical Machine Translation from English to Turkish

Tuesday, June 29 at Noon, in GHC 6501

Title: Syntax-to-Morphology Mapping in Factored Phrase-Based Statistical
Machine Translation from English to Turkish

Authors: Reyyan Yeniterzi and Kemal Oflazer


Abstract:

We present a novel scheme to apply factored phrase-based SMT to a
language pair with very disparate morphological structures. Our
approach relies on syntactic analysis on the source side (English) and
then encodes a wide variety of local and non-local syntactic
structures as complex structural tags which appear as additional
factors in the training data. On the target side (Turkish), we only
perform morphological analysis and disambiguation but treat the
complete complex morphological tag as a factor, instead of separating
morphemes. We incrementally explore capturing various syntactic
substructures as complex tags on the English side, and evaluate how
our translations improve in BLEU scores. Our maximal set of source and
target side transformations, coupled with some additional techniques,
provide an 39\% relative improvement from a baseline 17.08 to 23.78
BLEU, all averaged over 10 training and test sets. Now that the
syntactic analysis on the English side is available, we also
experiment with more long distance constituent reordering to bring the
English constituent order close to Turkish, but find that these
transformations do not provide any additional consistent tangible
gains when averaged over the 10 sets.

No comments: