How to print a pdf file to stdout using python?

Question

A correct pdf file has been created by a script (whose output can't be directly written to stdout, unfortunately). Say the file's name is 'myfile.pdf'.

I want to print the exact pdf content to stdout. (No processing in between).

To test this, I have written this short read_pdf.py script:

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

with open('myfile.pdf', mode='rb') as pdf_file:
    for line in pdf_file:
        print(str(line))

I use the 'rb' mode because reading this in text mode leads to a UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd0 in position 10: invalid continuation byte. So, it doesn't look like there's any other alternative (if text mode doesn't work, then binary mode).

Now of course the problem is that the output consists of b'blablabla' lines that cannot be used as a pdf file. To check it, I redirect read_pdf.py to a file and try to open it with a pdf viewer and of course it doesn't work:

$ ./read_pdf.py > test_output.pdf
$ evince test_output.pdf
Syntax Error: Couldn't find trailer dictionary
Syntax Error: Couldn't read xref table
Syntax Error: Couldn't find trailer dictionary
Syntax Error: Couldn't read xref table

So, what is the right way to do it? I haven't checked any pdf dedicated library because it doesn't look necessary, I'd like to be able to read and print correct content without importing a pdf library for that.

chardet.detect(pdf_file.read()) couldn't help (it returned {'encoding': None, 'confidence': 0.0}).

EDIT: * I'm looking for a solution for python3 and for a Linux/Unix system, not windows. * I need to know how to do this in python because its's actually part of a bigger project entirely written in python

Possible duplicate: http://stackoverflow.com/questions/2374427/python-2-x-write-binary-output-to-stdout — Robᵩ, Jul 05 '16 at 18:41
@armandino because it's actually part of a bigger project entirely written in python — zezollo, Jul 06 '16 at 05:27
@Robᵩ except this is for python3 and not about Windows. I will add this precisions to the question. — zezollo, Jul 06 '16 at 05:44

rll · Answer 1 · 2016-07-06T14:05:14.883

0

I think your problem is that you are reading line by line, therefore adding extra carriage returns. I tried and works perfectly on OSX:

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

with open('myfile.pdf', mode='rb') as pdf_file:
        print(pdf_file.read())

For the sake of completeness, as noted by @zezollo, in Linux the file will still be corrupted using the print function, so it is necessary to write directly on the buffer:

import sys

with open('myfile.pdf', mode='rb') as pdf_file:
    sys.stdout.buffer.write(pdf_file.read())

edited Jul 06 '16 at 14:05

answered Jul 05 '16 at 18:58

rll

5,509
3
31
46

This is dead simple, and better than my attempts, but the output is still "enclosed" inside a `b' '`. The output cannot be read by the pdf viewer. So I have naively tried to print `str(pdf_file.read())[2:-1]` instead. This looks good, but cannot be read by the pdf viewer neither. – zezollo Jul 05 '16 at 19:41
I was expecting to have the same behavior in OSX and Linux, but apparently there is some difference in the print implementation. Glad it helped. – rll Jul 06 '16 at 14:07

score 0 · Answer 2 · answered Jul 06 '16 at 05:48

The answer is actually to use sys.stdout.buffer.write(), instead of print(), and in addition to pdf_file.read():

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

import sys

with open('myfile.pdf', mode='rb') as pdf_file:
    sys.stdout.buffer.write(pdf_file.read())

How to print a pdf file to stdout using python?

2 Answers2