Loading

Correlation and Probability Based Similarity Measure for Detecting Outliers in Categorical Data
Roy Thomas1, J.E.Judith2

1Roy Thomas*, Noorul Islam Centre for Higher Education, Kumaracoil, India.
2J.E.Judith, Noorul Islam Centre for Higher Education, Kumaracoil, India.
Manuscript received on December 13, 2019. | Revised Manuscript received on December 22, 2019. | Manuscript published on January 10, 2020. | PP: 2577-2582 | Volume-9 Issue-3, January 2020. | Retrieval Number: C9053019320/2020©BEIESP | DOI: 10.35940/ijitee.C9053.019320
Open Access | Ethics and Policies | Cite | Mendeley | Indexing and Abstracting
© The Authors. Blue Eyes Intelligence Engineering and Sciences Publication (BEIESP). This is an open access article under the CC-BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)

Abstract: Determining the similarity or distance among data objects is an important part in many research fields such as statistics, data mining, machine learning etc. There are many measures available in the literature to define the distance between two numerical data objects. It is difficult to define such a metric to measure the similarity between two categorical data objects since categorical data objects are not ordered. Only a few distance measures are available in the literature to find the similarities among categorical data objects. This paper presents a comparative evaluation of various similarity measures for categorical data and also introduces a novel similarity measure for categorical data based on occurrence frequency and correlation. We evaluated the performance of these similarity measures in the context of outlier detection task in data mining using real world data sets. Experimental results show that the proposed similarity measure outperform the existing similarity measures to detect outliers in categorical datasets. The performances are evaluated in the context of outlier detection task in data mining. 
Keywords: Categorical, Correlation, Outlier, Similarity
Scope of the Article: Data Analytics