0

I am trying re-save a PDF with Ghostscript (to correct errors that PyPDF2 can't handle). I'm calling Ghostscript with subprocess.check_output, and I want to pass the original PDF in as STDIN and export the new one as STDOUT.

When I save the PDF to a file and read it back in, it works fine. When I try to pass the file in from STDOUT, it doesn't work. I think maybe this could be an encoding issue, but I don't want to encode anything to text, I just want binary data. Maybe there's something about encodings I don't understand.

How can I make the STDOUT data work like the file data?

import subprocess
from PyPDF2 import PdfFileReader
from io import BytesIO
import traceback

input_file_name = "SKMBT_42116071215160 (1).pdf"
output_file_name = 'saved2.pdf'
# input_file = open(input_file_name, "rb") # Moved below.

# Write to a file, then read the file back in. This works.
try:
    ps1 = subprocess.check_output(
        ('gs', '-o', output_file_name, '-sDEVICE=pdfwrite', '-dPDFSETTINGS=/prepress', input_file_name),
        # stdin=input_file # [edit] We pass in the file name, so this only confuses things.
    )
    # I use BytesIO() in this example only to make the examples parallel.
    # In the other example, I use BytesIO() because I can't pass a string to PdfFileReader().
    fakeFile1 = BytesIO()
    fakeFile1.write(open(output_file_name, "rb").read())
    inputpdf = PdfFileReader(fakeFile1)
    print inputpdf
except:
    traceback.print_exc()

print "---------"
# input_file.seek(0) # Added to address one comment. Removed while addressing another.
input_file = open(input_file_name, "rb")

# Export to STDOUT. This doesn't work.
try:
    ps2 = subprocess.check_output(
        ('gs', '-o', '-', '-sDEVICE=pdfwrite', '-dPDFSETTINGS=/prepress', '-'),
        stdin=input_file,
        # shell=True # Using shell produces the same error.
    )
    fakeFile2 = BytesIO()
    fakeFile2.write(ps2)
    inputpdf = PdfFileReader(fakeFile2)
    print inputpdf
except:
    traceback.print_exc()

Output:

   **** The file was produced by:
   **** >>>> KONICA MINOLTA bizhub 421 <<<<
<PyPDF2.pdf.PdfFileReader object at 0x101d1d550>
---------
   **** The file was produced by:
   **** >>>> KONICA MINOLTA bizhub 421 <<<<
Traceback (most recent call last):
  File "pdf_file_reader_test2.py", line 34, in <module>
    inputpdf = PdfFileReader(fakeFile2)
  File "/Library/Python/2.7/site-packages/PyPDF2/pdf.py", line 1065, in __init__
    self.read(stream)
  File "/Library/Python/2.7/site-packages/PyPDF2/pdf.py", line 1774, in read
    idnum, generation = self.readObjectHeader(stream)
  File "/Library/Python/2.7/site-packages/PyPDF2/pdf.py", line 1638, in readObjectHeader
    return int(idnum), int(generation)
ValueError: invalid literal for int() with base 10: "7-8138-11f1-0000-59be60c931e0'"
Travis
  • 1,998
  • 1
  • 21
  • 36
  • On windows, stdout needs to be configured as binary like here: http://stackoverflow.com/questions/2374427/python-2-x-write-binary-output-to-stdout . not sure it helps. Worth a try. – Jean-François Fabre Jul 13 '16 at 18:34
  • Worth mentioning, but I don't think that's the solution on this case. I'm using OS X, and I don't know of a similar setting that I could change. – Travis Jul 13 '16 at 18:37
  • Not sure but is that normal that you don't rewind the `input_file` between 2 calls? (the one that works and the one that doesn't) – Jean-François Fabre Jul 13 '16 at 19:15
  • I found that it didn't matter (I also tried using two file objects just to see). I edited my code for good measure, though. – Travis Jul 13 '16 at 19:22
  • I figured PyPDF must be rewinding it. Really, though, there's no need to use `input_file` in the first example since the input is given as a filename instead of STDIN, so I removed it. (Still doesn't work.) – Travis Jul 13 '16 at 19:48
  • makes sense: I wouldstill try to write fakefile1 and fakefile2 to disk and compare them binary-wise. – Jean-François Fabre Jul 13 '16 at 20:21
  • I didn't think that looking at binary files would mean anything to me, but I discovered there was a bunch of error output from Ghostscript mixed into the second file. See the answer I posted. – Travis Jul 13 '16 at 22:51

1 Answers1

0

Turns out, this has nothing to do with Python. It's a Ghostscript error. As pointed out in this post: Prevent Ghostscript from writing errors to standard output, Ghostscript writes errors to stdout, which corrupts files that are piped out.

Thanks to @Jean-François Fabre who suggested I look in the binary files.

Community
  • 1
  • 1
Travis
  • 1,998
  • 1
  • 21
  • 36
  • Please mark this answer as accepted so that this question no longer comes up as unresolved. Maybe retitle the question, too? Thanks. – tripleee Jul 14 '16 at 08:44
  • When I do, it says: "You can accept your own answer tomorrow" – Travis Jul 14 '16 at 15:35