Extracting text written in hindi from pdf in python

Question

I want to extract text typed in hindi from a pdf document.I've attached the image of the sample page I am dealing with.

I've tried using pdfminer to get text from it but the text is garbled (may be due to hindi fonts)

Now I am thinking of splitting the page in three parts and then splitting each part into two parts (seperating english and hindi text) then running ocr on each half to get text but only issue is I don't know the font used for hindi so I might again get garbled text.

My Question are, Is there some better way to deal with hindi fonts ? How Can I find font name ?

[Original Pdf](https://drive.google.com/file/d/0B6HtqTuwelWJeEtHb1F6Ty1YMlk/view?usp=sharing) — Gaurav Shukla, Mar 10 '16 at 15:08
@Gaurav : Did you had any opportunities to get solution for above ques? — Niks Jain, Nov 22 '17 at 11:54
Extracting font can be done with `pdfminer` for `python2` or `pdfminer.six` for python3, as shown here : [extracting font name](https://stackoverflow.com/questions/34606382/pdfminer-extract-text-with-its-font-information) — aspiring1, Sep 04 '19 at 05:05

score 1 · Answer 1 · answered Mar 10 '16 at 16:42

I have tried the following on your PDF and it appears to extract a lot of the text, I am guessing it might not be in the best layout but I am not able to tell.

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO

def convert_pdf_to_txt(path):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)

    with open(path, 'rb') as fp:
        interpreter = PDFPageInterpreter(rsrcmgr, device)
        password = ""
        caching = True
        pagenos = set()

        for page in PDFPage.get_pages(fp, pagenos, password=password,caching=caching, check_extractable=True):
            interpreter.process_page(page)

        text = retstr.getvalue()

    device.close()
    retstr.close()
    return text

print convert_pdf_to_txt("Electoral roll - Faizabad.pdf")

It displays as utf-8 so you must make sure your output console is capable to displaying using this.

For example:

भभग ससखखभककल मतदभतभ 11 1.रजजरभ आसशशकपपथममक ववददपलद रजजरप - सपमपनद779 420 359 0 779ननरभरचक नभमभरलल 2014 0S24उततर पददशवरधभन सभभ कदत कक ससखखभ ,नभम र आरकण सससनत:ललक सभभ कदत कक ससखखभ ,नभम र आरकण सससनत: 1 . पकनरलकण कभ वरररणपकनरलकण कभ ररर : 2014अहतभर कक नतथस: 01.01.2014पकनरलकण कभ सररप: ससककपत पकनरलकणपकभशन कक नतथस: 01.10.2013पकनरमकदण कक नतथस : 15.03.2014

To determine the list of fonts that it is using, you can simply load the PDF into a PDF reader such as Adobe Reader or Foxit Reader and select Properties from the File menu. From here you should be able to select Fonts. When I tried this with Foxit Reader it displayed the following fonts:

Mangal-Bold
Arial
Mangal
Arial Bold
Times-New-Roman-Bold

I also tried this. But the text extracted is not correct not even 60-70% — Gaurav Shukla, Mar 10 '16 at 16:45
I'm also facing the same issue..Any other relevant solution on this? — Niks Jain, Nov 22 '17 at 11:55

Extracting text written in hindi from pdf in python

1 Answers1

Linked