Email Header Feature Extraction using Adaptive and Collaborative approach for Email Classification
Amandeep Singh Rajput1, J S Sohal2, Vijay Athavale3
1Amandeep Singh Rajput, Department of Computer Science, GGI Khanna, Affiliated to IGK Punjab Technical University, Kapurthala, India.
2J S Sohal, Ludhiana College of Engineering & Technology, Ludhiana, Punjab.
3Vijay Athavale, Academy of Business & Engineering Sciences Engineering College, Ghaziabad, India.
Manuscript received on 04 May 2019 | Revised Manuscript received on 09 May 2019 | Manuscript Published on 13 May 2019 | PP: 158-164 | Volume-8 Issue-7S May 2019 | Retrieval Number: G10320587S19/19©BEIESP
Open Access | Editorial and Publishing Policies | Cite | Mendeley | Indexing and Abstracting
© The Authors. Blue Eyes Intelligence Engineering and Sciences Publication (BEIESP). This is an open-access article under the CC-BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
Abstract: Email Header is footprint of an Email that can be used to examine an Email as HAM or SPAM. Email classification in this research is done on the basis of header features thus by keeping the content privacy of the sender intact [1]. Header features are , email header fields like sender, to, cc, bcc, subject. This research tries to improve the accuracy of the classification by extracting more number of header features. Email Subject is further deeply examined for objectionable keywords for rule matching and rule generation. In our study, we implement an adaptive and collaborative approach by using machine learning and cluster computing for fast classification of Emails as SPAM or HAM. Adaptive approach is to generate new rules for classification and cluster approach is to use parallel computing power for increasing computing speed. New rules are only generated if features extracted from email header do not match the existing rules. Spam Assassin [2][3] is the main dataset used for testing. Collaborative approach creates a parallel environment where multiple antispam methods and divided test corpora are used as input. The false positive and false negative percentage are recorded and accuracy is calculated. Weka Data Mining Software is used to apply the anti-spam methods.
Keywords: Classification, Features, Ham, Spam, Machine Learning, Corpora, Parallel Environment, Cross Validation, Weka.
Scope of the Article: Classification