I have been using PyPDF2 to extract the text included in this PDF file (generated with pdfTeX-1.40.0) using Python 2.7. It works fine but now i have to extract text from same pdf generated with LibreOffice 4.3 and the result is this(not whole):
˜ ! ˜"!#$ %
˘ˇˆ˙˝
ˇ
˝%&˘
%'%
˛˚˛˜ !
"#$#"%$&
'##()˛˚˛
˛˚˛˜ !"#$#"%$%
*+!
This is my code:
pdfFileObj = open(filePath, 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
pageText = ""
for pageID in range(0, pdfReader.numPages):
pageObj = pdfReader.getPage(pageID)
pageText = pageText + "\n" + str(pageObj.extractText().encode('utf-8')))
for line in pageText:
extInfo = extInfo + line
pdfFileObj.close()
if string2search.replace(' ','') in extInfo:
stringPresent = True
else:
stringPresent = False
Is there any simple working solution for windows machine ? I found this topic about this, but there is no solution. I have also tried to use PDFMiner from this topic, but I get this error:
UnicodeEncodeError: 'ascii' codec cant encode character u'\xe9' in position 0: ordinal not in range (128)