An Efficient Multi-Phase Blocking Strategy for Entity Resolution in Big Data
Randa Mohamed Abd El-Ghafar1, Ali Hamed El-Bastawissy2, Eman S. Nasr3, Mervat H. Gheith4
1Randa M. Abd El-ghafar, Department of Computer Science, Faculty of Graduate Studies for Statistical Research, Cairo University, Cairo, Egypt.
2Ali H. El-Bastawissy, Faculty of Computer Science, Modern Sciences and Arts University, Cairo, Egypt.
3Eman S. Nasr, Independent Researcher, Cairo, Egypt.
4Mervat H. Gheith, Department of Computer Science, Faculty of Graduate Studies for Statistical Research, Cairo University, Cairo, Egypt.
Manuscript received on June 20, 2020. | Revised Manuscript received on June 28, 2020. | Manuscript published on July 10, 2020. | PP: 254-263 | Volume-9 Issue-9, July 2020 | Retrieval Number: 100.1/ijitee.I7070079920 | DOI: 10.35940/ijitee.I7070.079920
Open Access | Ethics and Policies | Cite | Mendeley
© The Authors. Blue Eyes Intelligence Engineering and Sciences Publication (BEIESP). This is an open access article under the CC-BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
Abstract: Entity Resolution (ER) is the process of identifying records that refer to the same real-world entity. It plays a key role in many applications as data warehouse, data integration, and business intelligence. Comparing every record with all corresponding records is infeasible especially for a big dataset. To overcome such a problem, blocking techniques have been implemented. In this paper, we propose a novel Efficient Multi-Phase Blocking Strategy (EMPBS) for resolving duplicates in big data. As per our knowledge, some state of art blocking techniques may result in overlapping blocks (i.e. Q-grams) which cause redundant comparisons and hence increase the time complexity. Our proposed blocking strategy has disjoint blocks and less time complexity compared to Q-grams and slandered blocking techniques. In addition, EMPBS is general and requires no restrictions on the type of blocking keys. EMPBS consists of three phases. The first one generates three single efficient blocking keys. The second phase takes the output of the first phase as an input to construct a compound key. The compound key is composed of concatenation of two single blocking keys. Three compound blocking keys are the output of this phase that will be used as an input for the last phase, which is generating the Efficient Multi-Phase Blocking Key (EMPBK). EMPBK is constructed using the union of two compound blocking keys. The implementation of EMPBS presents promising results in terms of Reduction Ratio (RR) as it achieves a higher value of RR than adopting only a single blocking key, while at the same time maintains nearly the same precision and recall. EMPBS reduced about 84% of the average number of comparisons accomplished in a single blocking key. To evaluate EMPBS, we developed a Duplicate Generation tool (Dup Gen) that accepts a clean semi-structured file as an input and generates labeled duplicate records according to certain criteria.
Keywords: Entity resolution, Record linkage, Big Data, Blocking Techniques.
Scope of the Article: Big Data Analytics