Extract hindi Text from a PDF file

Question

I am working on a task to extract some information (in HINDI) from a pdf file and convert it into a data frame.

I have tried many things and followed many articles, and answers on stack overflow as well. I tried different libraries like easy OCR, paddle OCR, and others but was unable to get the correct output.

Here is the link for the document. Link

Things I have tried:

It seems that I am unable to get the exact contours to create the bounding box. Below you can see the image of the output I am getting.

All I need is to convert this information to a data frame where the columns would be:- नाम: पति का नाम / पिता का नाम: मकान संख्याः an so on.

Below is the code I am using to get data:-

import cv2
import pytesseract
import numpy as np
from pytesseract import Output
image = cv2.imread('pages_new/page3.jpg')
img = image.copy()
mask = np.zeros(image.shape, dtype=np.uint8)
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]
# Filter for ROI using contour area and aspect ratio
countour = cv2.findContours(thresh, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE)
countour = countour[0] if len(countour) == 2 else countour[1]
for c in countour:
    area = cv2.contourArea(c)
    peri = cv2.arcLength(c, True)
    approx = cv2.approxPolyDP(c, 0.05 * peri, True)
    x,y,w,h = cv2.boundingRect(approx)
    aspect_ratio = w / float(h)
    if area > 10000 and aspect_ratio > .5:
        mask[y:y+h, x:x+w] = image[y:y+h, x:x+w]
h, w, c = img.shape
boxes = pytesseract.image_to_boxes(img) 
for b in boxes.splitlines():
    b = b.split(' ')
    img = cv2.rectangle(img, (int(b[1]), h - int(b[2])), (int(b[3]), h - int(b[4])), (0, 255, 0), 2)
d = pytesseract.image_to_data(img, output_type=Output.DICT)
n_boxes = len(d['text'])
for i in range(n_boxes):
    if int(d['conf'][i]) > 60:
        (x, y, w, h) = (d['left'][i], d['top'][i], d['width'][i], d['height'][i])
        img2 = cv2.rectangle(img, (x, y), (x + w, y + h), (0, 255, 0), 2)
# cv2.imshow('img', img)
# cv2.imshow('img2', img2)
# Perfrom OCR with Pytesseract
data = pytesseract.image_to_string(mask, lang='Devanagari', config='--psm 6')
print(data)
# cv2.imshow('thresh', thresh)
# cv2.imshow('mask', mask)

Also, Could anyone please confirm if the information on the page changes, Do we have to write a different code for all the documents or we can get a generic script for all the docs?

This question is too broad and confusing. Do you want to "extract some information" or "get the exact contours" or "convert this information to a data frame" or do you want to know if "can get a generic script for all the docs"? Please identify the problem that you have, before asking a question. — Thomas Weller, Apr 14 '23 at 08:10
My main motive here is to convert the information into a data frame, but as you can see every page is having different information I wanted to know if we can create a generic script for all pages or if a single code like the above can do the work. I want the output into a dataframe as I shown in my last question. — Lalit Joshi, Apr 14 '23 at 08:13
Okay, and have you solved this yourself as in https://stackoverflow.com/q/76012665/480982 ? If so, you should write an answer to your own question. — Thomas Weller, Apr 14 '23 at 08:15

Vikram Baliga · Answer 1 · 2023-04-17T19:20:52.640

I'm also working on the same concept, extracting voters' info from the voters list downloaded from the CEO website. I'm doing for Karnataka, in Kanada. I'm using freely available tools online to convert the PDF to image. Then using Matlab to split the image of one full page into each voter's box.

I did so because none of the available converters perform well with full page with so many bounding boxes. Also, the most important data to extract, is the voter's number in the list. This is not clear enough at least in our PDFs to be extracted as English digits.

So, I calculate the number based on the files row and column. For the voters in separate files, I manually enter the data.

So, the concept is to:

Manually convert one pdf file into multiple images manually using any freely available website.
Convert the images black and white so that you can split into boxes very easily without any computer vision tools. (Based on lines).
Save each of the extracted box as image file for further reference or send it to OCR library.
Epic number, name, house number and father, guardian name can be extracted easily. But the main number required (in my case, not sure about yours), the number in the list, needs to be calculated and manually verified as it cannot be extracted by any of the available libraries.

Here is the library, Tesseract that I'm using to convert Kanada text in the image. This is opensource library and can extract Hindi too.

So instead of re-inventing the wheel, you can do some manual processing and then use this tool which will give better text extraction which will save a lot of time.

I will soon upload my code files into a pubic github repo and share it here. Using the files, you can convert pages to images of each user. This you can use as per Ur requirements. The code is in Matlab.

https://github.com/tesseract-ocr/tesseract

[PS: This may, for Karnataka elections, I'll be publishing my android/iOS apps in the stores. The apps will have this voters list in English. Easily searchable using names, house numbers or the voters' EPIC numbers. For last Karnataka assembly elections, I had published only android app in google play store. It got removed due to privacy policy issue. Need to clearly mention that the info is available in freely public domain.]

Thanks for the answer Vikram, do share the code once you are done. I have also found a solution for this, check if this helps. — Lalit Joshi, Apr 18 '23 at 18:18

score 0 · Accepted Answer · answered Apr 18 '23 at 18:17

I am able to get the data using the below code:-

import cv2
import numpy as np
import pdf2image
import pytesseract

# Extract page 3 from PDF in proper quality
page_3 = np.array(pdf2image.convert_from_path('ROLL_Download.aspx.pdf',
                                              first_page=3, last_page=3,
                                              dpi=300, grayscale=True)[0])

# Inverse binarize for contour finding
thr = cv2.threshold(page_3, 128, 255, cv2.THRESH_BINARY_INV)[1]

cnts = cv2.findContours(thr, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_NONE)
cnts = cnts[0] if len(cnts) == 2 else cnts[1]
cnts_tables = [cnt for cnt in cnts if cv2.contourArea(cnt) > 10000]
no_tables = cv2.drawContours(thr.copy(), cnts_tables, -1, 0, cv2.FILLED)
data = []
rects = sorted([cv2.boundingRect(cnt) for cnt in cnts_tables], key=lambda r: (r[1], r[0]))
for i_r, (x, y, w, h) in enumerate(rects, start=1):
    
    cnts = cv2.findContours(page_3[y+1:y+h-1, x+1:x+w-1], cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_NONE)
    cnts = cnts[0] if len(cnts) == 2 else cnts[1]    
    
    inner_rects = sorted([cv2.boundingRect(cnt) for cnt in cnts], key=lambda r: (r[1], r[0]))
    
    print('\nExtract texts inside table {}\n'.format(i_r))
    for (xx, yy, ww, hh) in inner_rects:
        # Set current coordinates w.r.t. full image
        xx += x
        yy += y

        # Get current cell
        cell = page_3[yy+2:yy+hh-2, xx+2:xx+ww-2]
            
        # Floodfill rectangles around numbers
        ys, xs = np.min(np.argwhere(cell == 0), axis=0)
        print("The value for xs:{} and ys:{}".format(xs, ys))
        temp = cv2.floodFill(cell.copy(), None, (xs, ys), 255)[1]
        mask = cv2.morphologyEx(thr[yy+2:yy+hh-2, xx+2:xx+ww-2].copy(), cv2.MORPH_DILATE, np.full((ww, hh),255))

cnts = cv2.findContours(mask, cv2.RETR_LIST, cv2.CHAIN_APPROX_NONE)
        cnts = cnts[0] if len(cnts) == 2 else cnts[1]
        boxes = sorted([cv2.boundingRect(cnt) for cnt in cnts], key=lambda b: b[0])

        # Extract texts from each part of the current cell
        for x_b, y_b, w_b, h_b in boxes:
#             print("The value for i_b is:",i_b)

            text = pytesseract.image_to_string(
                temp[y_b:y_b+h_b, x_b:x_b+w_b],
                config='--psm 6',
                lang='Devanagari')
            #text = text.replace('\f', '')
            print('x: {}, y: {}, text:\n{}'.format(xx, yy, text))
            data.append(text)

Below is the output:-

Extract texts inside table 1
    
    The value for xs:13 and ys:0
    x: 103, y: 209, text:
    1 WEZ1761006
    नाम : भीमसेन
    पिता का नाम : बच्चू सिंह
    मकान संख्या: देव नगर Photo is
    आयु : 33 लिंग : पुरुष Available
    
    The value for xs:13 and ys:0
    x: 857, y: 209, text:
    2 WEZ1391713
    नाम : पूजा कुमारी
    पिता का नाम : विपिन सोनी
    मकान संख्याः वार्ड नं1 Photo is
    आयु : 23 लिग : स्त्री Available
    
    The value for xs:13 and ys:0
    x: 1610, y: 209, text:
    3 WEZ1781897
    नाम : सोनू
    पति का नाम : राजू
    मकान संख्याः वार्ड नं2 Photo is
    आयु : 3 लिग : स्त्री Available
    
    
    Extract texts inside table 2
    
    The value for xs:13 and ys:41
    x: 103, y: 507, text:
    #174 WEZ1735174
    नाम : रागिणी कुमारी कामत
    पिता का नाम : संतोष कामत
    मकान संख्याः 31 Photo is
    आयु : 19 लिग : स्त्री Available
    
    The value for xs:13 and ys:41
    x: 857, y: 507, text:
    5 WEZ1766005
    नाम : पर्तीक सिंग चिडे
    माता का नाम : कुलविंदर कौर
    मकान संख्याः देव नगर ,वार्ड नं. 2 Photo is
    आयु : 20 लिग : पुरुष Available
    
    The value for xs:13 and ys:41
    x: 1610, y: 507, text:
    [|] WEZ1755230
    नाम : रीता देवी
    पति का नाम : प्रेम यादव
    मकान संख्या: हाऊस नं. 05 Photo is
    आयु : ॐ लिग : स्त्री Available
    
    
    Extract texts inside table 3
    
    The value for xs:13 and ys:0
    x: 103, y: 807, text:
    7 WEZ1758721
    नाम : विश्व जीत वर्मा
    पिता का नाम : राम चद्र
    मकान संख्या: हाऊस नं. 10, वार्ड नं. 2 Photo is
    आयु : 25 लिंग : पुरुष Available
    
    The value for xs:13 and ys:0
    x: 857, y: 807, text:
    । | WEZ1758739
    नाम : हिम्मत वर्मा
    पिता का नाम : राम चद्र
    मकान संख्या: हाऊस नं. 10, वार्ड नं. 2 Photo is
    आयु : 23 लिंग : पुरुष Available
    
    The value for xs:13 and ys:0
    x: 1610, y: 807, text:
    [१ WEZ1427087
    नाम : सोनू यादव
    पिता का नाम : ददन यादव
    मकान संख्या: हाऊस नं. 228 Photo is
    आयु : 23 लिंग : पुरुष Available

Extract hindi Text from a PDF file

2 Answers2

Linked