Topic Modeling Based Extractive Text Summarization
Kalliath Abdul Rasheed Issam1, Shivam Patel2, Subalalitha C. N.3
1Kalliath Abdul Rasheed Issam*, Department of Computer Science and Engineering, SRM Institute of Science and Technology, Kattankulathur, Chennai, India.
2Shivam Patel, Department of Computer Science and Engineering, SRM Institute of Science and Technology, Kattankulathur, Chennai, India.
3Subalalitha C. N., Department of Computer Science and Engineering, SRM Institute of Science and Technology, Kattankulathur, Chennai, India.
Manuscript received on March 15, 2020. | Revised Manuscript received on March 26, 2020. | Manuscript published on April 10, 2020. | PP: 1710-1719 | Volume-9 Issue-6, April 2020. | Retrieval Number: F4611049620/2020©BEIESP | DOI: 10.35940/ijitee.F4611.049620
Open Access | Ethics and Policies | Cite | Mendeley
© The Authors. Blue Eyes Intelligence Engineering and Sciences Publication (BEIESP). This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
Abstract: Text summarization is an approach for identifying important information present within text documents. This computational technique aims to generate shorter versions of the source text, by including only the relevant and salient information present within the source text. In this paper, we propose a novel method to summarize a text document by clustering its contents based on latent topics produced using topic modeling techniques and by generating extractive summaries for each of the identified text clusters. All extractive sub-summaries are later combined to generate a summary for any given source document. We utilize the lesser used and challenging Wiki How dataset in our approach to text summarization. This dataset is unlike the commonly used news datasets which are available for text summarization. The well-known news datasets present their most important information in the first few lines of their source texts, which make their summarization a lesser challenging task when compared to summarizing the Wiki How dataset. Contrary to these news datasets, the documents in the Wiki How dataset are written using a generalized approach and have lesser abstractedness and higher compression ratio, thus proposing a greater challenge to generate summaries. A lot of the current state-of-the-art text summarization techniques tend to eliminate important information present in source documents in the favor of brevity. Our proposed technique aims to capture all the varied information present in source documents. Although the dataset proved challenging, after performing extensive tests within our experimental setup, we have discovered that our model produces encouraging ROUGE results and summaries when compared to the other published extractive and abstractive text summarization models.
Keywords: Extractive Text Summarization, Latent Dirichlet Allocation, Topic Clustering, Topic Modeling, Wiki How Dataset
Scope of the Article: Clustering