I start with a PDF file where extra, unwanted content after %%EOF is present in the binary code.
In the python code I try to remove this unwanted content by reading the PDF file, check where the last %%EOF is present and create a new file where the content after %%EOF is not added anymore.
Code:
def main():
check_and_correct_file()
def check_and_correct_file():
# Read file, retrieve content and close file
read_file = open('any_pdf_extra.pdf', encoding="latin-1", mode='rt')
file = read_file.readlines()
read_file.close()
# Set starting values of variables
eof_value = '%%EOF'
line_count = 0
last_eof = 0
# Set last occurrence of %%EOF
for line in file:
if eof_value in line:
last_eof = line_count
line_count += 1
# Write all content except for data after last occurrence of %%EOF
write_file = open('any_pdf_fixing.pdf', encoding="latin-1", mode='wt')
for i in range(0, last_eof + 1):
write_file.write(file[i])
write_file.close()
Binary code of original PDF file:
Binary code of processed PDF file:
As shown in the image, the content after the %%EOF is removed, but the LF's (Line feed) and CR's (Carriage return) are replaced by both of them which makes the processed PDF file unable to be opened.
Is there a solution for this problem? Or perhaps an alternative way to do this?