Extracting text from a PDF file using Python 2.7 on Windows 7

Question

I have been using PyPDF2 to extract the text included in this PDF file (generated with pdfTeX-1.40.0) using Python 2.7. It works fine but now i have to extract text from same pdf generated with LibreOffice 4.3 and the result is this(not whole):

˜ ! ˜"!#$  %
˘ˇˆ˙˝
ˇ
˝%&˘
%'%
˛˚˛˜ !
"#$#"%$&
'##()˛˚˛
˛˚˛˜  !"#$#"%$%
*+!

This is my code:

    pdfFileObj = open(filePath, 'rb')
    pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
    pageText = ""
    for pageID in range(0, pdfReader.numPages): 
        pageObj = pdfReader.getPage(pageID)
        pageText = pageText + "\n" + str(pageObj.extractText().encode('utf-8')))
    for line in pageText:
        extInfo = extInfo + line
    pdfFileObj.close()

    if string2search.replace(' ','') in extInfo:
        stringPresent = True
    else:
        stringPresent = False

Is there any simple working solution for windows machine ? I found this topic about this, but there is no solution. I have also tried to use PDFMiner from this topic, but I get this error:

UnicodeEncodeError: 'ascii' codec cant encode character u'\xe9' in position 0: ordinal not in range (128)

score 0 · Answer 1 · edited Nov 19 '19 at 14:37

0

I believe your problem is the encoding before read

pdfFileObj = open(filePath, 'rb',encoding="utf-8") 
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
pageText = ""
for pageID in range(0, pdfReader.numPages): 
    pageObj = pdfReader.getPage(pageID)
    pageText = pageText + "\n" + str(pageObj.extractText().encode('utf-8')))
for line in pageText:
    extInfo = extInfo + line
pdfFileObj.close()

if string2search.replace(' ','') in extInfo:
    stringPresent = True
else:
    stringPresent = False

edited Nov 19 '19 at 14:37

zamir

2,144
1
11
23

answered Jan 05 '18 at 14:00

Noohone

694
5
12

I have tried it but i get error: "TypeError: 'encoding' is an invalid keyword argument for this function" – Budlog Jan 05 '18 at 14:22
try to "r" and not "rb" – Noohone Jan 05 '18 at 15:12
Error: ''PyPDF2.utils.PdfReadError: EOF marker not found" – Budlog Jan 06 '18 at 10:31

score 0 · Accepted Answer · answered Jan 08 '18 at 10:00

I have finally found solution for this.

1.- Download Xpdf tools for windows

2.- copy pdftotext.exe from xpdf-tools-win-4.00\bin32 to C:\Windows\System32 and also to C:\Windows\SysWOW64

3.- use code:

import subprocess

try:
    extInfo = subprocess.check_output('pdftotext.exe '+filePath + ' -',shell=True,stderr=subprocess.STDOUT).strip()
except Exception as e:
    print (e) 

if string2search in extInfo:
    stringPresent = True
else:
    stringPresent = False

Extracting text from a PDF file using Python 2.7 on Windows 7

2 Answers2