Feature Based Identification of Web Page Noise through K-Means Clustering
S. S. Bhamare1, B. V. Pawar2
1S. S. Bhamare, School of Computer Sciences, Kavayitri Bahinabai Chaudhari North Maharashtra University, Jalgaon, India.
2B. V. Pawar, School of Computer Sciences, Kavayitri Bahinabai Chaudhari North Maharashtra University, Jalgaon, India.
Manuscript received on December 13, 2019. | Revised Manuscript received on December 22, 2019. | Manuscript published on January 10, 2020. | PP: 1966-1970 | Volume-9 Issue-3, January 2020. | Retrieval Number: C9023019320/2020©BEIESP | DOI: 10.35940/ijitee.C9023.019320
Open Access | Ethics and Policies | Cite | Mendeley
© The Authors. Blue Eyes Intelligence Engineering and Sciences Publication (BEIESP). This is an open access article under the CC-BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
Abstract: Web pages has pieces of information which are of unequal importance like navigational bar, copyright notice, links, advertisement etc. and these are considered as noise or insignificant items of web page for web mining. Web page informative content is only useful for performing effective web mining task and presence of noise on web page can hamper the result of this task. Web page has several features including information location, occupied area and its contents. Content data in different portions of an internet web page has dissimilar significance weights according to its location, occupied location and content that are considered to be features of the web page. The position of contents and importance of contents play a vital role in identification of noise in web pages for removal. In this paper web page feature based method is proposed for identification of noise from web pages. K-means clustering technique is applied to classify main content information and noise content information into two clusters of web pages based on these features. For performance evaluation of clustering method, accuracy, precision, f-measure, and recall are calculated.
Keywords: Noise, Feature Extraction, Clustering, HTML Tag, Tag Weight, Web Pages.
Scope of the Article: Clustering