I have taken the code from another thread here that uses the library PyPDF2 to parse and replace the text of a PDF. The given example PDF in the thread is parsed as a PyPDF2.generic.DecodedStreamObject
. I am currently working with a PDF that the company has provided me that was created using Microsoft Word's Export to PDF feature. This generates a PyPDF2.generic.EncodedStreamObject
. From exploration, the main difference is that there is what appears to be kerning in some places in the text.
This caused two problems for me with the sample code. Firstly, the line if len(contents) > 0:
in main
seems to get erroneously triggered and attempts to use the key of the EncodedStreamObject dictionary instead of the EncodedStreamObject itself. To work around this, I commented out the if block and used the code in the else block for both cases.
The second problem was that the (what I assume are) kerning markings broke up the text I was trying to replace. I noticed that kerning was not in every line, so I made the assumption that the kerning markers were not strictly necessary, and tried to see what the output would look like with them removed. The text was structured something like so: [(Thi)4(s)-1(is t)2(ext)]. I replaced the line in the sample code replaced_line = line
in replace_text
with replaced_line = "[(" + "".join(re.findall(r'\((.*?)\)', line)) + ")] TJ"
. This preserved the observed structure while allowing the text to be searched for replacements. I verified this was actually replacing the text of the line.
Neither of those changes prevented the code from executing, however the output PDF seems to be completely unchanged despite the code appearing to work using print statements to check if the replaced line has the new text. I initially assumed this was because of the if block in process_data that determined if it was Encoded or Decoded. However, I dug through the actual source code for this library located here, and it seems that if the object is Encoded, it generates a Decoded version of itself which the if block reflects. My only other idea is that the if block that I commented out in main
wasn't erroneously catching my scenario, but was instead handling it incorrectly. I have no idea how I would fix it so that it handles it properly.
I feel like I'm incredibly close to solving this, but I'm at my wits end as to what to do from here. I would ask the poster of the linked solution in a comment, but I do not have enough reputation to comment on SO. Does anyone have any leads on how to solve this problem? I don't particularly care what library or file format is used, but it must retain the formatting of the Word document I have been provided. I have already tried exporting to HTML, but that removes most of the formatting and also the header. I have also tried converting the .docx to PDF in Python, but that requires me to actually have Word installed on the machine, which is not a cross-platform solution. I also explored using RTF, but from what I found the solution for that file type is to convert it to a .docx and then to PDF.
Here is the full code that I have so far:
import PyPDF2
import re
def replace_text(content, replacements=dict()):
lines = content.splitlines()
result = ""
in_text = False
for line in lines:
if line == "BT":
in_text = True
elif line == "ET":
in_text = False
elif in_text:
cmd = line[-2:]
if cmd.lower() == 'tj':
replaced_line = "[(" + "".join(re.findall(r'\((.*?)\)', line)) + ")] TJ"
for k, v in replacements.items():
replaced_line = replaced_line.replace(k, v)
result += replaced_line + "\n"
else:
result += line + "\n"
continue
result += line + "\n"
return result
def process_data(obj, replacements):
data = obj.getData()
decoded_data = data.decode('utf-8')
replaced_data = replace_text(decoded_data, replacements)
encoded_data = replaced_data.encode('utf-8')
if obj.decodedSelf is not None:
obj.decodedSelf.setData(encoded_data)
else:
obj.setData(encoded_data)
pdf = PyPDF2.PdfFileReader("template.pdf")
# pdf = PyPDF2.PdfFileReader("sample.pdf")
writer = PyPDF2.PdfFileWriter()
replacements = {
"some text": "replacement text"
}
for page in pdf.pages:
contents = page.getContents()
# if len(contents) > 0:
# for obj in contents:
# streamObj = obj.getObject()
# process_data(streamObj, replacements)
# else:
process_data(contents, replacements)
writer.addPage(page)
with open("output.pdf", 'wb') as out_file:
writer.write(out_file)
EDIT:
I've somewhat tracked down the source of my problems. The line obj.decodedSelf.setData(encoded_data)
seems to not actually set the data properly. After that line, I added
print(encoded_data[:2000])
print("----------------------")
print(obj.getData()[:2000])
The first print statement was different from the second print statement, which definitely should not be the case. To really test see if this was true, I replaced every single line with [()], which I know to be valid as there are many lines that are already that. For the life of me, though, I can't figure out why this function call fails to do any lasting changes.
EDIT 2:
I have further identified the problem. In the source code for an EncodedStreamObject in the getData method, it returnsself.decodedSelf.getData()
if self.decodedSelf is True. HOWEVER, after doing obj.decodedSelf.setData(encoded_data)
, if I do print(bool(obj.decodedSelf))
, it prints False
. This means that when the EncodedStreamObject is getting accessed to be written out to the PDF, it is re-parsing the old PDF and overriding the self.decodedSelf
object! Short of going in and fixing the source code, I'm not sure how I would solve this problem.
EDIT 3:
I have managed to convince the library to use the decoded version that has the replacements! By inserting the line page[PyPDF2.pdf.NameObject("/Contents")] = contents.decodedSelf
before writer.addPage(page)
, it forces the page to have the updated contents. Unfortunately, my previous assumption about the text kerning was incorrect. After I replaced things, some of my text mysteriously disappeared from the PDF. I assume this is because the format is incorrect somehow.
FINAL EDIT:
I figure I'd put this in here in case anyone else stumbles across this. I never did manage to get it to finally work as expected. I instead moved to a solution to mimic the PDF with HTML/CSS. If you add the following style tag in your HTML, you can get it to print more like how you'd expect a PDF to print:
<style type="text/css" media="print">
@page {
size: auto;
margin: 0;
}
</style>
I'd recommend this solution for anyone looking to do what I was doing. There are Python libraries to convert HTML to CSS, but they do not support HTML5 and CSS3 (notably they do not support CSS flex or grid). You can just print the HTML page to PDF from any browser to accomplish the same thing. It definitely doesn't answer the question, so I felt it best to leave it as an edit. If anyone manages to complete what I have attempted, please post an answer for me and any others.