A Comparative Analysis of Hindi Multi Word Expressions using Relevance Measure-RMMWE
Rakhi Joon1, Archana Singhal2
1Rakhi Joon, Department of Computer Science, University of Delhi, Delhi, India.
2Archana Singhal, Department of Computer Science, IP College for Women, University of Delhi, Delhi, India.
Manuscript received on 02 June 2019 | Revised Manuscript received on 10 June 2019 | Manuscript published on 30 June 2019 | PP: 3436-3445 | Volume-8 Issue-8, June 2019 | Retrieval Number: H6869068819/19©BEIESP
Open Access | Ethics and Policies | Cite | Mendeley | Indexing and Abstracting
© The Authors. Blue Eyes Intelligence Engineering and Sciences Publication (BEIESP). This is an open access article under the CC-BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
Abstract: Text processing is a very complex and tedious task because of various types of ambiguities present in the text. There are various methods suggested by the researchers for text processing, which mainly include generic procedures of word extraction and phrase extraction. Word Extraction mainly deals with extraction of meaningful words from the text, while phrase extraction is the process of extracting relevant phrases. Multiword Expressions (MWEs) extraction is the phrase extraction procedure where key phrases are used as an alias to Multiword lexemes. MWEs are made up from two or more words conveying different meaning as compared to the meaning of the individual words. The measures used for extraction and analysis of MWEs, mainly include the baseline and statistical measures. In baseline measures, Precision, Recall and F-Measure are considered while, in statistical measures, point wise Mutual Information (PMI), Dice Coefficient (DC), and Modified Dice Coefficient (MDC) measures are considered. In the proposed work one additional measure in statistical category i.e. Relevance Measure (RM), is proposed along with the existing ones. Relevance Measure is evaluated based on the frequency of occurrence of MWEs in Hindi Text. The dataset used in this paper for experimental purpose is Hindi Dataset taken from the famous Hindi novel ‘Godaan’. An algorithm has also been designed for evaluating the relevance measure. Evaluation of these measures have been done for 2-grams MWEs and n-grams MWEs. The values calculated for each measure for different categories of Hindi MWEs are shown in tabular form and results are analyzed and discussed with the help of different histograms of RM and other measures. The statistical consideration of RM has not been done till now due to which it become difficult to find out which Hindi MWEs type is more relevant. To solve the above issue, RM for Hindi MWEs is explored in the proposed work, and the inclusion is justified by comparing the results with other existing measures.
Keyword: Relevance Measure, Multiword Expressions, Hindi, Keyphrase, NLP, statistical measures.
Scope of the Article: Analysis of Algorithms and Computational Complexity.