Inverted Indexing for Information Retrieval from Motifs and Domains of Proteins
Kumud Pant1, Bhasker Pant2, Devvret Verma3, Promila Sharma4, Vikas Tripathi5
1Kumud Pant*, Department of Biotechnology, Graphic Era Deemed to be University, Dehradun, India.
2Bhasker Pant, Department of Computer Science & Engineering Graphic Era Deemed to be University, Dehradun, India.
3Devvret Verma, Department of Biotechnology, Graphic Era Deemed to be University, Dehradun, India.
4Promila Sharma, Department of Biotechnology, Graphic Era Deemed to be University, Dehradun, India.
5Vikas Tripathi, Department of Computer Science & Engineering Graphic Era Deemed to be University, Dehradun, India.
Manuscript received on December 16, 2019. | Revised Manuscript received on December 22, 2019. | Manuscript published on January 10, 2020. | PP: 63-68 | Volume-9 Issue-3, January 2020. | Retrieval Number: C8044019320/2020©BEIESP | DOI: 10.35940/ijitee.C8044.019320
Open Access | Ethics and Policies | Cite | Mendeley
© The Authors. Blue Eyes Intelligence Engineering and Sciences Publication (BEIESP). This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
Abstract: The recent advancement in technologies are generating huge amount of data and extracting information from it is being outpaced by data accumulation. The development of hybrid approaches by combining different algorithms for extraction of required from the stock-pile of data is a demand of the hour. One such algorithm is vector space model for inverted indexing that has been used traditionally for search engine indexing in computers. In bioinformatics also it has been used for assembly of DNA fragments generated after sequencing. But it has not been applied for retrieval of relevant protein sequence to the query, based on presence or absence of motifs and domains in it. In this paper the concept of inverted indexing has been applied on small motif/domain data of proteins contained in Motivated Proteins database at http://motif.gla.ac.uk/motif/index.html. The index has been built using 17 small hydrogen bonded motifs present in a dataset of 430 proteins. The entire dataset of 430 proteins has been divided into 19 classes. Seven classes’ example cyanovirin, antibiotic and concavalin etc. had very few instances (1 or 2), hence have been omitted from further studies. Rest 12 classes with more than 10 proteins were considered further for testing information retrieval (IR) strategy. The document vector of all the proteins belonging to one class was averaged and 12 queries with averaged vector were prepared for testing. The similarity coefficient (SC) was then compared between query and all the proteins of the dataset. This approach could successfully classify the query as belonging to the class from which it derived. To further validate the importance of document vector as novel attribute for classification, entire dataset of document vector was clustered to ten (10) clusters. Testing was then performed with similarity coefficient (SC) of the query with clusters obtained above. The allocation of cluster to the 12 query sequences followed the same pattern as done with relevant document search using inverted indexing approach. But clustering allocated the queries to only four (4) classes. Maximum number of query proteins (7 proteins or 58%) were found belonging to cluster 5.
Keywords: Information Retrieval, Motif Domain, Clustering, Inverted Indexing Computing Classification System: I.4
Scope of the Article: Classification