How to get data from pdf in Cyrillic?

Question

I have error when I try to get data in cyrillic

import codecs
pdfFileObj = codecs.open('1.pdf', 'rb','utf-8')

The error is

'utf8' codec can't decode byte 0x9c in position 1: invalid start byte

score 1 · Answer 1 · edited Jun 20 '20 at 09:12

1

PDF is not a textfile

PDF is not unicode, PDF is full of binary streams, with text, images and so on.

Take look at PyPDF2. To get text from first page do

pdf = PdfFileReader(open('/tmp/russian.pdf', 'rb'))
text = pdf.getPage(0).extractText()

Though you might also need to convert it to windows-1251

text.encode('latin').decode('windows-1251')

edited Jun 20 '20 at 09:12

Community

answered Oct 05 '17 at 13:40

pacholik

1

спасибо я пробовал но все таки не выводит кириллицу – Leskhan Karatayev Oct 06 '17 at 15:20
@ЛесханКаратаев You'd have to show me that pdf. And please speak English. – pacholik Oct 06 '17 at 20:33
can you give me youur e-mail. – Leskhan Karatayev Oct 07 '17 at 00:54
@ЛесханКаратаев No… – pacholik Oct 07 '17 at 11:59
1

Guys, any updates on this? I'm not sure it's able to "read" Cyrillic at all. `len(pdf.getPage(0).extractText())` — 0 symbols. – Kirby Sep 05 '21 at 17:05

score 0 · Answer 2 · answered Dec 28 '21 at 00:09

0

This is a solution with pdfminer.six; it supports cyrillic chars

from pdfminer import high_level

with open('file.pdf', 'rb') as f:
    text = high_level.extract_text(f)
    print(text)

answered Dec 28 '21 at 00:09

Rugnar