Parsing cp1251 pdf to text in python

Question

Is there any way to extract text from the pdf file with russian text (cp1251)?

For parsing pdf files I am using pdfminer package. I tried to specify encoding in the argument to pdfminer.converter.TextConverter class but it didn't help.

It's not clear what you want to do once you have the text, you want to parse it with python? — Richard, Aug 26 '15 at 14:26
I want to extract all the text (that can be extracted) from the pdf, and then analyse it using the nltk package. — Anatoliy Makhort, Aug 26 '15 at 15:02

score -1 · Accepted Answer · edited May 23 '17 at 11:43

-1

If you want to parse the text further once extract it from PDF file you would need python... So just extract the text first without convert the text and save it in a txt file.

You may use pdf2txt for this purpose (with unbuntu : http://manpages.ubuntu.com/manpages/precise/man1/pdf2txt.1.html)

Then you open the file with python and you convert the text form cp1251 to utf-8, the accepted answer here will show you how to do :

How to convert a string from CP-1251 to UTF-8?

Then parse...

edited May 23 '17 at 11:43

Community

1
1

answered Aug 26 '15 at 14:35

Richard

721
5
16

Thank for your answer, but is there any way for extracting the text without using external executables, like pdf2txt, just using some python package? – Anatoliy Makhort Aug 26 '15 at 15:05
I tried to use pdftotext from command line, but for pdf with russian text it don't want to work properly (it extracts ONLY english words with special symbols - both ascii). – Anatoliy Makhort Aug 26 '15 at 15:32
Is there a way you can attach the PDF file so we can play with? Or a link to it... – Richard Aug 26 '15 at 15:37
I pdf2txt with russian PDF and get bunch of errors, I will look if it pdfminer that fail or pdf2txt which could maybe not support unicode... If this case I will to fix it. – Richard Aug 26 '15 at 16:25
I found that pdf2txt.exe (GUI program for Windows, Ver1.3, homepage: http://www.pdf2txt.com/) launched from python using subprocess.call function converted the test russian pdf correctly (and also test english pdf). – Anatoliy Makhort Aug 26 '15 at 16:34
This may also help with PDFminer : http://stackoverflow.com/questions/6870214/python-special-characters-giving-me-problems-from-pdfminer – Richard Aug 26 '15 at 17:14
pdf2txt -c 'cp1251' -o out.txt cl-develop-scalable-bluemix-app-pdf.pdf with this file : https://www.ibm.com/developerworks/ru/library/cl-develop-scalable-bluemix-app/cl-develop-scalable-bluemix-app-pdf.pdf seems to work, you then have it in cp1251 in the text file... Here the source of pdfminer pdf2txt : https://github.com/euske/pdfminer/blob/master/tools/pdf2txt.py – Richard Aug 26 '15 at 17:41

Parsing cp1251 pdf to text in python

1 Answers1