How to replace text in a PDF using Python?

Question

I have taken the code from another thread here that uses the library PyPDF2 to parse and replace the text of a PDF. The given example PDF in the thread is parsed as a PyPDF2.generic.DecodedStreamObject. I am currently working with a PDF that the company has provided me that was created using Microsoft Word's Export to PDF feature. This generates a PyPDF2.generic.EncodedStreamObject. From exploration, the main difference is that there is what appears to be kerning in some places in the text.

This caused two problems for me with the sample code. Firstly, the line if len(contents) > 0: in main seems to get erroneously triggered and attempts to use the key of the EncodedStreamObject dictionary instead of the EncodedStreamObject itself. To work around this, I commented out the if block and used the code in the else block for both cases.

The second problem was that the (what I assume are) kerning markings broke up the text I was trying to replace. I noticed that kerning was not in every line, so I made the assumption that the kerning markers were not strictly necessary, and tried to see what the output would look like with them removed. The text was structured something like so: [(Thi)4(s)-1(is t)2(ext)]. I replaced the line in the sample code replaced_line = line in replace_text with replaced_line = "[(" + "".join(re.findall(r'\((.*?)\)', line)) + ")] TJ". This preserved the observed structure while allowing the text to be searched for replacements. I verified this was actually replacing the text of the line.

Neither of those changes prevented the code from executing, however the output PDF seems to be completely unchanged despite the code appearing to work using print statements to check if the replaced line has the new text. I initially assumed this was because of the if block in process_data that determined if it was Encoded or Decoded. However, I dug through the actual source code for this library located here, and it seems that if the object is Encoded, it generates a Decoded version of itself which the if block reflects. My only other idea is that the if block that I commented out in main wasn't erroneously catching my scenario, but was instead handling it incorrectly. I have no idea how I would fix it so that it handles it properly.

I feel like I'm incredibly close to solving this, but I'm at my wits end as to what to do from here. I would ask the poster of the linked solution in a comment, but I do not have enough reputation to comment on SO. Does anyone have any leads on how to solve this problem? I don't particularly care what library or file format is used, but it must retain the formatting of the Word document I have been provided. I have already tried exporting to HTML, but that removes most of the formatting and also the header. I have also tried converting the .docx to PDF in Python, but that requires me to actually have Word installed on the machine, which is not a cross-platform solution. I also explored using RTF, but from what I found the solution for that file type is to convert it to a .docx and then to PDF.

Here is the full code that I have so far:


import PyPDF2
import re


def replace_text(content, replacements=dict()):
    lines = content.splitlines()

    result = ""
    in_text = False

    for line in lines:
        if line == "BT":
            in_text = True

        elif line == "ET":
            in_text = False

        elif in_text:
            cmd = line[-2:]
            if cmd.lower() == 'tj':
                replaced_line = "[(" + "".join(re.findall(r'\((.*?)\)', line)) + ")] TJ"
                for k, v in replacements.items():
                    replaced_line = replaced_line.replace(k, v)
                result += replaced_line + "\n"
            else:
                result += line + "\n"
            continue

        result += line + "\n"

    return result


def process_data(obj, replacements):
    data = obj.getData()
    decoded_data = data.decode('utf-8')

    replaced_data = replace_text(decoded_data, replacements)

    encoded_data = replaced_data.encode('utf-8')
    if obj.decodedSelf is not None:
        obj.decodedSelf.setData(encoded_data)
    else:
        obj.setData(encoded_data)


pdf = PyPDF2.PdfFileReader("template.pdf")
# pdf = PyPDF2.PdfFileReader("sample.pdf")
writer = PyPDF2.PdfFileWriter()

replacements = {
    "some text": "replacement text"
}

for page in pdf.pages:

    contents = page.getContents()

    # if len(contents) > 0:
    #   for obj in contents:
    #       streamObj = obj.getObject()
    #       process_data(streamObj, replacements)
    # else:
    process_data(contents, replacements)

    writer.addPage(page)

with open("output.pdf", 'wb') as out_file:
    writer.write(out_file)

EDIT:

I've somewhat tracked down the source of my problems. The line obj.decodedSelf.setData(encoded_data) seems to not actually set the data properly. After that line, I added

print(encoded_data[:2000])
print("----------------------")
print(obj.getData()[:2000])

The first print statement was different from the second print statement, which definitely should not be the case. To really test see if this was true, I replaced every single line with [()], which I know to be valid as there are many lines that are already that. For the life of me, though, I can't figure out why this function call fails to do any lasting changes.

EDIT 2:

I have further identified the problem. In the source code for an EncodedStreamObject in the getData method, it returnsself.decodedSelf.getData() if self.decodedSelf is True. HOWEVER, after doing obj.decodedSelf.setData(encoded_data), if I do print(bool(obj.decodedSelf)), it prints False. This means that when the EncodedStreamObject is getting accessed to be written out to the PDF, it is re-parsing the old PDF and overriding the self.decodedSelf object! Short of going in and fixing the source code, I'm not sure how I would solve this problem.

EDIT 3:

I have managed to convince the library to use the decoded version that has the replacements! By inserting the line page[PyPDF2.pdf.NameObject("/Contents")] = contents.decodedSelf before writer.addPage(page), it forces the page to have the updated contents. Unfortunately, my previous assumption about the text kerning was incorrect. After I replaced things, some of my text mysteriously disappeared from the PDF. I assume this is because the format is incorrect somehow.

FINAL EDIT:

I figure I'd put this in here in case anyone else stumbles across this. I never did manage to get it to finally work as expected. I instead moved to a solution to mimic the PDF with HTML/CSS. If you add the following style tag in your HTML, you can get it to print more like how you'd expect a PDF to print:

<style type="text/css" media="print">
    @page {
        size: auto;
        margin: 0;
    }
</style>

I'd recommend this solution for anyone looking to do what I was doing. There are Python libraries to convert HTML to CSS, but they do not support HTML5 and CSS3 (notably they do not support CSS flex or grid). You can just print the HTML page to PDF from any browser to accomplish the same thing. It definitely doesn't answer the question, so I felt it best to leave it as an edit. If anyone manages to complete what I have attempted, please post an answer for me and any others.

You are manipulating `contents`, but then doing `writer.addPage(page)`. Do you have any guarantee that what `getContents` returns is not just a COPY of the contents? I don't know that, but it would explain your results. — Tim Roberts, Feb 26 '21 at 21:29
I have verified that writer.addPage(page) works given the sample PDF that is of type DecodedStreamObject. I have also verified that `obj.setData(encoded_data)` changes the _data field in the object. However, `obj.decodedSelf.setData(encoded_data)` does not change the _data field in the decodedSelf object. Manipulating the _data field does not seem to leave lasting changes either, which is what throws me off guard the most. — Zenith2198, Feb 26 '21 at 21:43
*"After I replaced things, some of my text mysteriously disappeared from the PDF."* - you might want to read [this answer](https://stackoverflow.com/a/60655298/1729265), in particular the section on "Subset fonts", to understand this — mkl, Feb 27 '21 at 08:05
i think your solution is only for the pdf without compressed. you could decompress pdf first, and then run this replace code. and then compress your pdf again. — justicepenny, Sep 19 '21 at 23:52
EDIT 3 "page[NameObject("/Contents")] = contents.decodedSelf" before "writer.addPage(page)" worked for me! — Vladimir Simoes da Luz Junior, Sep 30 '21 at 16:25

How to replace text in a PDF using Python?

0 Answers0

Linked