A Broad Coverage of Corpus for Understanding Translation Divergences
Simran Kaur Jolly1, Rashmi Agrawal2
1Simran Kaur Jolly, Research Schola, Manav Rachna International Institute, Faridabad, India.
2Rashmi Agrawal, Research Schola, Manav Rachna International Institute, Faridabad, India.
Manuscript received on 20 June 2019 | Revised Manuscript received on 27 June 2019 | Manuscript Published on 22 June 2019 | PP: 613-618 | Volume-8 Issue-8S2 June 2019 | Retrieval Number: H11030688S219/19©BEIESP
Open Access | Editorial and Publishing Policies | Cite | Mendeley | Indexing and Abstracting
© The Authors. Blue Eyes Intelligence Engineering and Sciences Publication (BEIESP). This is an open-access article under the CC-BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
Abstract: The objective of natural language understanding is to exploit the rich resources like text corpora for semantic categorization of texts. In natural language understanding corpus based statistical approaches are being used for language modeling and translation modeling. In this paper we applied the sentence preprocessing using factored base translation models on Europarl dataset and results show that pre-processing reduces the number of out of the vocabulary words accurately. This paper also defines methodology for preprocessing the parallel dataset using factored based model from Europarl dataset which can be used in machine translation ahead.
Keywords: BOW , Corpus, (Bags of Words), , Out of the Vocabulary Words (OOV), Parts of Speech Tagger (POS), Segmentation.
Scope of the Article: Natural Language Processing and Machine Translation