A correct pdf file has been created by a script (whose output can't be directly written to stdout, unfortunately). Say the file's name is 'myfile.pdf'.
I want to print the exact pdf content to stdout. (No processing in between).
To test this, I have written this short read_pdf.py
script:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
with open('myfile.pdf', mode='rb') as pdf_file:
for line in pdf_file:
print(str(line))
I use the 'rb'
mode because reading this in text mode leads to a UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd0 in position 10: invalid continuation byte
. So, it doesn't look like there's any other alternative (if text mode doesn't work, then binary mode).
Now of course the problem is that the output consists of b'blablabla'
lines that cannot be used as a pdf file. To check it, I redirect read_pdf.py
to a file and try to open it with a pdf viewer and of course it doesn't work:
$ ./read_pdf.py > test_output.pdf
$ evince test_output.pdf
Syntax Error: Couldn't find trailer dictionary
Syntax Error: Couldn't read xref table
Syntax Error: Couldn't find trailer dictionary
Syntax Error: Couldn't read xref table
So, what is the right way to do it? I haven't checked any pdf dedicated library because it doesn't look necessary, I'd like to be able to read and print correct content without importing a pdf library for that.
chardet.detect(pdf_file.read())
couldn't help (it returned {'encoding': None, 'confidence': 0.0}
).
EDIT: * I'm looking for a solution for python3 and for a Linux/Unix system, not windows. * I need to know how to do this in python because its's actually part of a bigger project entirely written in python