I have error when I try to get data in cyrillic
import codecs
pdfFileObj = codecs.open('1.pdf', 'rb','utf-8')
The error is
'utf8' codec can't decode byte 0x9c in position 1: invalid start byte
I have error when I try to get data in cyrillic
import codecs
pdfFileObj = codecs.open('1.pdf', 'rb','utf-8')
The error is
'utf8' codec can't decode byte 0x9c in position 1: invalid start byte
PDF is not unicode, PDF is full of binary streams, with text, images and so on.
Take look at PyPDF2. To get text from first page do
pdf = PdfFileReader(open('/tmp/russian.pdf', 'rb'))
text = pdf.getPage(0).extractText()
Though you might also need to convert it to windows-1251
text.encode('latin').decode('windows-1251')
This is a solution with pdfminer.six; it supports cyrillic chars
from pdfminer import high_level
with open('file.pdf', 'rb') as f:
text = high_level.extract_text(f)
print(text)