Can't get text out of PDF file with PyPDF2

Question

I am trying to get the text from a PDF file I downloaded with PyPDF. Here is my code:

if not PyPDF2.PdfFileReader('download.pdf').isEncrypted:
    PyPDF2.PdfFileReader('download.pdf').getPage(0).extractText()

This is the output:

'\n\n˘ˇ˘ˆ˙\n˝˛˚˜!\n\n\n\n#\nˇ˘ˆ˙ˆ˝˛˝\n˙˙˘ ˘ˆ"˝\n$!%˙(˝)˙*˜+,˝-.#/.(#0)0)/.1.+02345.\n˛˛ˇ/#.$/0/70/#.+322.32˙˘˛˘˘\n˛˘ 8˙˘9:˘ˆ;\n˛˘\n\n˝=\n˙˘˛\n.ˇ<9:˘ˇˇ%˘˛ˇ ˘˘<˘\n˝>"?˝˘$@<˘*ˆˆ˘˙˘A˘B˘˙˘˛ˇ!˛˘˙˘˛ˇ˘\n1C˙ˆ˘06˛˘8+˛9:˘D10+E˝ˆ˘8\n$˘˘9:˘˘1C˙ˆ˘+˘F˛˘D$1+FE˝˘˛˘˘<˘?˝\n////)*˘1˘˛ ?GG˜*HI\nD˘˙A˘E\nJ$\n˛\nDLE///M˛˝˛˙˘˛˘˛\n˛˘˛>"?\n˙˘˛\n˛\n/)M6;˝˛˙˘˛˘\n˛\n///˛\n\n'

When I open the file its content is fine. Also when I use another program to transform pdf into txt it works fine. It is a javascript rendered pdf on a webpage, don't know if it makes any difference.

Hi, is the pdf a generated one or is it a scan of a printed page for example? — Baedsch, Oct 11 '18 at 13:57

score 2 · Answer 1 · answered Nov 02 '18 at 16:56

2

Under Win 7, Python 3.6, I had the problem that PyPDF2 did not properly encode some PDF files. My solution was to use pdfminer.six.

pip install pdfminer.six

To extract text from a PDF, you can use functions such as the one in this post: https://stackoverflow.com/a/42154976/9524424

Worked perfect for me...

answered Nov 02 '18 at 16:56

Peter

2,120
2
19
33

1

This was helpful. Not sure why this question got downvoted, seems like a real problem and a real solution. – WJA Apr 08 '20 at 08:44
1

That was helpful for me as well. Pity that the question was not marked as answered. There are a plenty of similar questions to this as well. – pedrez Oct 05 '21 at 14:07

score 0 · Answer 2 · answered Oct 11 '18 at 13:54

The following is taken from the documentation (https://pythonhosted.org/PyPDF2/PageObject.html)

extractText() Locate all text drawing commands, in the order they are provided in the content stream, and extract the text. This works well for some PDF files, but poorly for others, depending on the generator used. This will be refined in the future. Do not rely on the order of text coming out of this function, as it will change if this function is made more sophisticated. Returns: a unicode string object.

So, it seems that the performance of this function depends on the pdf itself.

I don't think the question is about performance in that sense—it sounds to me like it's about not understand why the text of the PDF file isn't being returned properly from the `extractText()` method. — martineau, Oct 12 '18 at 23:42

Can't get text out of PDF file with PyPDF2

2 Answers2