0

I start with a PDF file where extra, unwanted content after %%EOF is present in the binary code.

In the python code I try to remove this unwanted content by reading the PDF file, check where the last %%EOF is present and create a new file where the content after %%EOF is not added anymore.

Code:

def main():
    check_and_correct_file()


def check_and_correct_file():
    # Read file, retrieve content and close file
    read_file = open('any_pdf_extra.pdf', encoding="latin-1", mode='rt')
    file = read_file.readlines()
    read_file.close()

    # Set starting values of variables
    eof_value = '%%EOF'
    line_count = 0
    last_eof = 0

    # Set last occurrence of %%EOF
    for line in file:
        if eof_value in line:
            last_eof = line_count

        line_count += 1

    # Write all content except for data after last occurrence of %%EOF
    write_file = open('any_pdf_fixing.pdf', encoding="latin-1", mode='wt')
    for i in range(0, last_eof + 1):
        write_file.write(file[i])
    write_file.close()

Binary code of original PDF file:

enter image description here

Binary code of processed PDF file:

enter image description here

As shown in the image, the content after the %%EOF is removed, but the LF's (Line feed) and CR's (Carriage return) are replaced by both of them which makes the processed PDF file unable to be opened.

Is there a solution for this problem? Or perhaps an alternative way to do this?

Joeri Verlooy
  • 613
  • 5
  • 19
  • 2
    Hi, welcome on Stack Overflow! make sure to provide your code as formatted text instead of image. – scharette Jun 12 '18 at 14:50
  • Thanks @scharette ! I'll try to keep that advise in mind for following posts. :) – Joeri Verlooy Jun 12 '18 at 14:57
  • 2
    You still have time to edit. – scharette Jun 12 '18 at 14:58
  • PDF is a complex format consisting of internal tables and references - documented in [about 750+ pages(1.7)](https://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/PDF32000_2008.pdf) and next to 1000 (2.0) pages. You can't simply remove parts of the hex-data and think it will miracously work. Use a PDF-Library. Why do you want to remove it at all?!? – Patrick Artner Jun 12 '18 at 15:03
  • @PatrickArtner It only is meant to remove a single line at the end, which should not impact the PDF file, since when you open and save the unprocessed PDF file in adobe, the line is removed as well. The reason is that we have an internal tool which has problems opening these unprocessed PDF files. – Joeri Verlooy Jun 12 '18 at 15:11
  • 3
    The issue: You treat PDF, a binary format (yes, it does have most of its markers defined using words in ASCII encoding, but by nature it is binary), like a text format, and so you damage it. Instead of reading and writing in text mode, "line by line", read and write it in binary mode and process a byte array buffer of a few K at a time. And properly handle the case of the %%EOF being split between buffers. – mkl Jun 12 '18 at 15:36
  • [reading-binary-file-and-looping-over-each-byte](https://stackoverflow.com/questions/1035340/reading-binary-file-and-looping-over-each-byte) – Patrick Artner Jun 12 '18 at 16:16
  • I got this fixed by opening and saving the file without any alterations with the PyPDF library, which gave the file a better formatting and fixed the EOF issue. Thanks for all the answers. – Joeri Verlooy Jun 26 '18 at 09:13

0 Answers0