5

Got excellent help by BrTH with python PDF reading HERE

The problem now is that I'm dealing with foregin (Brazillian Portuguese) texts. Got and resolved "UnicodeDecodeError" fatal error adding this to BrTH's code

 import codecs

 #next lines are inside getPDFContent function
 content += pdf.getPage(i).extractText() + "\n"
 content = content.decode("utf-8")

The problem now is that "print getPDFContent(f)" only returns two blank lines from any PDF I try to use it. Obviously keyword search only returns "False". Could you please help me again?

Community
  • 1
  • 1
Sarchophagi
  • 377
  • 2
  • 5
  • 20
  • Do you now that the document is actually in UTF-8? IIRC, Acrobat defaults to "WinAnsi" on Windows and MacRoman on Mac if all of the characters fit, and otherwise to a special "Identity-H" where it stores the glyphs in an arbitrary order and uses indices into that glyph table as code points. – abarnert Aug 03 '14 at 12:57
  • So I try using .decode("WinAnsi")? Where can I get the PDF encoding? – Sarchophagi Aug 03 '14 at 13:00
  • I don't know if PDFs embed their encoding, or, if so, how to extract it; sorry. Anyway, 'WinAnsi' isn't a real charset, but I think it usually means cp1252. But the bigger problem may be that whatever `extractText()` is doing gets confused by the encoding before it even gets to your code, if it's a heuristic function. (I don't know much about PyPDF.) – abarnert Aug 03 '14 at 14:32
  • Also, are you sure the documents really are text, and not just vector or bitmap images of text? If you open the file in Acrobat, can you select and copy text? – abarnert Aug 03 '14 at 14:33
  • Hold on, reading [the docs](http://pybrary.net/pyPdf/pythondoc-pyPdf.pdf.html), it seems like PyPDF tries to decode the data and return it to you as `unicode` (I'm assuming Python 2.x here; correct me if I'm wrong). If so, you shouldn't be calling `decode` on it. Can you `print type(content) ` before the `decode` and see whether it's `str` or `unicode`? Also, can you `print repr(content)` to see whether there's anything coming out in the first place? If it's empty, or a long string of control characters, you're not solving the right problem. – abarnert Aug 03 '14 at 14:37
  • PDFs are full selectable text with a watermark bitmap on the right top of each page. Does that interfeer? Also `print type(content)` prints `` and `print repr(content)` prints `'u\n'`. Any clues? – Sarchophagi Aug 03 '14 at 17:09
  • Just so you know I'm using Python 2.7, and removed the `decode` part to perform the above tests. I've also commented the original code line `content = " ".join(content.replace("\xa0", " ").strip().split())` to avoid `UnicodeDecodeError` errors once I'm not using `decode` – Sarchophagi Aug 03 '14 at 17:20
  • Tried [THIS](http://mensenhandel.nl/files/pdftest2.pdf) google-found test pdf and its printing fine. Its gotta be the encoding of my files :( Or can it be some kind of PDF lock? – Sarchophagi Aug 03 '14 at 19:32
  • OK, the problem isn't the encoding of the file; PyPDF is taking care of that automatically. You're breaking that by using `+ "\n"` and `" ".join` instead of `+ u"\n"` and `u" ".join`. If you don't understand why, read the Unicode HOWTO in the Python docs. But that isn't the real problem; you're not getting back any text. – abarnert Aug 03 '14 at 22:46
  • I got that already and understand why. But what about the real problem, any clue why I'm not getting any text back? Again, the pdfs are true selectable text, and I got the text sucessfully from a test PDF using same code... – Sarchophagi Aug 04 '14 at 02:37
  • Well, the documentation for PyPDF says that it uses heuristics to extract the text. Heuristics, by their very nature, can never be 100% correct. Can you try different libraries and see what they provide? For example, PDFMiner is supposed to have more extensive heuristics, or, if you're on a platform that Adobe supports, you may be able to script their app. If PyPDF can't read the documents you want, then PyPDF can't read the documents you want, and unless you're hoping to contribute to the PyPDF project, I don't know what else you're hoping for. – abarnert Aug 04 '14 at 07:52
  • Tried PDFMiner and it works! – Sarchophagi Aug 04 '14 at 11:11
  • `pypdf` has received many updates in 2022. This + moving to Python3 likely fixes the issues mentioned here – Martin Thoma Feb 11 '23 at 09:46

0 Answers0