Tuesday, April 22, 2008

Simulating Sentence Pairs Sampling Process via Source and Target Language Models

Speaker: Ngyuen Bach

Abstract: In a traditional word alignment process, each sentence pair is equally assigned an occurrence number, which is normalized during the training to produce the empirical probability. However, some sentences could be more valuable, reliable and appropriate than others. These sentences should therefore have a higher weight in the training. To solve this problem, we explored methods of resampling sentence pairs. We investigated three sets of features: sentence pair confidence (/sc/), genre-dependent sentence pair confidence (/gdsc/) and sentence-dependent phrase alignment confidence (/sdpc/) scores. These features were calculated over an entire training corpus and could easily be integrated into the phrase-based machine translation system.