A new Connected Component Analysis based System for Text Segmentation in Degraded Historical Document Images
V. Sathya Narayanan1, N. Kasthuri2, T. Dharani3, D. Deepa4
1Mr.V.Sathya Narayanan*, Assistant Professor, Electronics and Communication Engineering Department, Kongu Engineering College, Perundurai, Erode, India.
2Dr.N.Kasthuri, Professor, Electronics and Communication Engineering Department, Kongu Engineering College, Perundurai, Erode, India.
3T.Dharani, Student, Electronics and Communication Engineering Department, Kongu Engineering College, Perundurai, Erode, India.
4D.Deepa, Student, Electronics and Communication Engineering Department, Kongu Engineering College, Perundurai, Erode, India.
Manuscript received on March 15, 2020. | Revised Manuscript received on March 25, 2020. | Manuscript published on April 10, 2020. | PP: 69-75 | Volume-9 Issue-6, April 2020. | Retrieval Number: F3503049620/2020©BEIESP | DOI: 10.35940/ijitee.F3503.049620
Open Access | Ethics and Policies | Cite | Mendeley
© The Authors. Blue Eyes Intelligence Engineering and Sciences Publication (BEIESP). This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
Abstract: Historical documents contain valuable heritage information. These documents are preserved in the manuscript preservation center and archaeological departments. They are mostly degraded in nature and hence hard to read and understand the contents. So, there is a need for text segmentation and feature extraction to convert these manuscripts into machine editable format. In this work, we present an effective way to segment historical document images into characters. It is a challenging segmentation process due to complex background images. In this paper, horizontal histogram, vertical histogram and connected component analysis is used to segment text documents images. In this algorithm, the input image is converted to gray scale image, then gray image is converted into binary image [Otsu’s method] and then all the objects containing fewer than desired pixels are removed. Line and word segmentation is implemented using horizontal and vertical histogram method respectively. Then the connected components are labeled and properties are measured for the image regions. Connected component analysis is used to segment the characters and the individual characters are extracted. The simulation result shows that the proposed segmentation method achieves an average accuracy of 93.37% for HDLAC 2011 DATASET. Moreover this method is more efficient and more suitable for real time tasks.
Keywords: Otsu Method, Horizontal Histogram Method, Vertical Histogram Method, Connected Component Analysis, Bounding box Segmentation, HDLAC Dataset.
Scope of the Article: Predictive Analysis.