0

I can use the poppler library to easily create an image from a pdf using:

pdftoppm -png myfile.pdf > myfile.png

I'm now trying to use the python-poppler library to do the same from within Python. After installing the lib (sudo apt-get install python-poppler) I can load in a pdf file using the following:

doc = poppler.document_new_from_file('file://'+urllib(inputF), password=None)

but I now want to load a pdf file from binary. I thought I could use the method poppler.document_new_from_data(), so I tried the following, which returns a type error:

>>> d = poppler.document_new_from_data(userDoc.binary)
Traceback (most recent call last):
  File "<input>", line 1, in <module>
TypeError: document_new_from_data() argument 1 must be string without null bytes, not Binary

I'm unsure what this means though. What "data" can be a "string without null bytes, not Binary"? I tried checking out the source of the method, but the source files (here) don't even contain a single .py file.

I tried converting the binary to base64, but that leads to an error saying TypeError: Required argument 'length' (pos 2) not found.

Any help would be welcome!

[EDIT] Thanks to the tip of @Vaulstein I now got a bit further:

s = binascii.a2b_base64(userDoc.binary)
r = poppler.document_new_from_data(s, len(s), password='')Syntax Warning: May not be a PDF file (continuing anyway)
Syntax Error (3): Illegal character <75> in hex string
Syntax Error (4): Illegal character <df> in hex string
Syntax Error (5): Illegal character <5d> in hex string
Syntax Error (6): Illegal character <28> in hex string
Syntax Error (7): Illegal character <6e> in hex string
Syntax Error (8): Illegal character <3f> in hex string
Syntax Error (9): Illegal character <ca> in hex string
Syntax Error (10): Illegal character <89> in hex string
Syntax Error (11): Illegal character <db> in hex string
>>> r = poppler.document_new_from_data(s, len(s), password='')
Traceback (most recent call last):
  File "<input>", line 1, in <module>
GError: PDF document is damaged

But it still doesn't seem to be the correct encoding. Any other idea how I can do this?

Jongware
  • 22,200
  • 8
  • 54
  • 100
kramer65
  • 50,427
  • 120
  • 308
  • 488
  • Have you tried `binascii.a2b_base64(data)` ? – Vaulstein May 13 '15 at 11:06
  • @Vaulstein - I just tried that, and that indeed brings me a bit further. I now get a `GError: PDF document is damaged`. I added a new part to the question. Any idea what could be wrong now? – kramer65 May 13 '15 at 14:52
  • You should check the link [poppler](https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=668777). – Vaulstein May 14 '15 at 06:24
  • @Vaulstein - Thank you for that link. So I now understand that the problem is that "the very first object in such PDF is a stream starting with "<" character", and the solution should be "adding a dummy object to the PDF fixes the problem". Would you know how I can "add a dummy object to the PDF"? Any tips on that would be very welcome! – kramer65 May 14 '15 at 07:39
  • you can find the difference on the link between two documents [test1.tex](https://bugs.debian.org/cgi-bin/bugreport.cgi?msg=3;filename=test1.tex;att=2;bug=668777) and [test2.tex](https://bugs.debian.org/cgi-bin/bugreport.cgi?msg=3;filename=test2.tex;att=3;bug=668777) this is how the dummy object was added: `\immediate\pdfobj stream {}` – Vaulstein May 14 '15 at 09:15

1 Answers1

2

The poppler_document_new_from_data call requires to pass the whole binary data, including 0 bytes, as the first argument, as a char* (which typically is a str in Python 2). You discovered a bug in poppler-python. As @Vaulstein pointed out in a comment, it has been reported upstream but is unresolved.

As a workaround, either store the PDF to a file and use the ..new_from_file call, or use the gi.repository.Popplermodule instead. (The module comes with PyGObject; see e.g. here for an example, and here's the documentation for poppler_document_new_from_data.)

Community
  • 1
  • 1
Phillip
  • 13,448
  • 29
  • 41
  • @Philip: I think that the bug has already been reported but not resolved. It is present in the list of bugs. – Vaulstein May 18 '15 at 10:17