1

I have error when I try to get data in cyrillic

import codecs
pdfFileObj = codecs.open('1.pdf', 'rb','utf-8')

The error is

'utf8' codec can't decode byte 0x9c in position 1: invalid start byte
pacholik
  • 8,607
  • 9
  • 43
  • 55

2 Answers2

1

PDF is not a textfile

PDF is not unicode, PDF is full of binary streams, with text, images and so on.

Use some PDF library

Take look at PyPDF2. To get text from first page do

pdf = PdfFileReader(open('/tmp/russian.pdf', 'rb'))
text = pdf.getPage(0).extractText()

Though you might also need to convert it to windows-1251

text.encode('latin').decode('windows-1251')
Community
  • 1
  • 1
pacholik
  • 8,607
  • 9
  • 43
  • 55
0

This is a solution with pdfminer.six; it supports cyrillic chars

from pdfminer import high_level

with open('file.pdf', 'rb') as f:
    text = high_level.extract_text(f)
    print(text)

see also https://stackoverflow.com/a/70501572/3367753

Rugnar
  • 2,894
  • 3
  • 25
  • 29