6

I am trying to modify text in a PDF file. The text can be in an object of type Tj or BDC. I find the correct objects and if I read them directly after changing them they show the updated values.

But if I pass the complete page to PdfFileWriter the change is lost. I might be updating a copy and not the real object. I checked the id() and it was different. Does someone have an idea how to fix this?

from PyPDF2 import PdfFileReader, PdfFileWriter
from PyPDF2.generic import TextStringObject, NameObject, ContentStream
from PyPDF2.utils import b_

reader = PdfFileReader("some.pdf")
writer = PdfFileWriter()

for page_idx in range(0, 1):

    # Get the current page and it's contents
    page = reader.getPage(page_idx)

    content_object = page["/Contents"].getObject()
    content = ContentStream(content_object, reader)

    for operands, operator in content.operations:

        if operator == b_("BDC"):

            operands[1][NameObject("/Contents")] = TextStringObject("xyz")

        if operator == b_("Tj"):

            operands[0] = TextStringObject("xyz")

    writer.addPage(page)


# Write the stream
with open("output.pdf", "wb") as fp:
    writer.write(fp)
Martin Thoma
  • 124,992
  • 159
  • 614
  • 958
Joe
  • 6,758
  • 2
  • 26
  • 47
  • `for operands, operator in ` give you a copy from `content`? – stovfl Sep 25 '18 at 15:08
  • Probably, but I am not 100 % sure. This is also what I thought of. But I haven't found a direct way to address objects directly. – Joe Sep 25 '18 at 15:38
  • Can't find `.getObject()` in [PyPDF2.pdf](https://pythonhosted.org/PyPDF2/PageObject.html#PyPDF2.pdf.PageObject). Couldn't understand why you reread from `source`: `content = ContentStream(content_object, source)`, i think at this point you loose the previous `page`, but do `output.addPage(page)`. – stovfl Sep 25 '18 at 16:15
  • I had a look at the source at github and the type of `page[NameObject('/Contents')]` is `PyPDF2.generic.EncodedStreamObject`, this means its `.getObject()` is from `EncodedStreamObject > StreamObject > DictionaryObject > PdfObject`. So the method called in the end is [this one.](https://github.com/mstamy2/PyPDF2/blob/18a2627adac13124d4122c8b92aaa863ccfb8c29/PyPDF2/generic.py#L103) – Joe Sep 26 '18 at 05:56
  • Just checked the `id()` and the call to `.getObject()` is not needed, it is the same one. – Joe Sep 26 '18 at 05:59
  • Your hint seems to point in the right direction. `content = ContentStream(content_object, source)` does not seem to be the problem. I run that twice and `content` has the same `id()`. But when I loop over `content.operations` the `ìd` of `operands` and `operator` are different. – Joe Sep 26 '18 at 06:31
  • The closest example i found: [a way to rename field name](https://github.com/mstamy2/PyPDF2/issues/407). There is a `.update(...)` function, also not in the docs. Relevant: [how-to-write-to-variable](https://stackoverflow.com/questions/26529269/how-to-write-to-variable-instead-of-to-file-in-python). Another module [PyMuPDF](https://stackoverflow.com/questions/50306870/place-a-vertical-or-rotated-text-in-a-pdf-with-python) – stovfl Sep 26 '18 at 07:17
  • Ok, thanks. Will look into that. I also asked the question on [github issues](https://github.com/mstamy2/PyPDF2/issues/459). – Joe Sep 26 '18 at 07:45

1 Answers1

4

The solution is to assign the ContentStream that is being iterated and changed to the page afterwards before passing it to the PdfFileWriter:

page[NameObject('/Contents')] = content
writer.addPage(page)

I found the solution reading this and this.

Martin Thoma
  • 124,992
  • 159
  • 614
  • 958
Joe
  • 6,758
  • 2
  • 26
  • 47
  • Great! Next issue I bumped against is using strings instead of `PyPDF2.pdf.TextStringObject`. It looks OK `repr`-wise, but then raises `AttributeError` (missing `writeToStream` or something) when trying to save the resulting PDF. – Tomasz Gandor Nov 02 '19 at 15:35
  • You could use [createStringObject](https://github.com/mstamy2/PyPDF2/blob/master/PyPDF2/generic.py#L281). – Joe Nov 02 '19 at 17:30
  • Hey man, I am having trouble with this as well. I had a working find-and-replace function that is now non-working due to this issue. I wrote up a question, and KJ pointed me in your direction. Can you take a look? https://stackoverflow.com/questions/72451312/how-does-one-read-a-pdf-into-memory-using-pypdf2 – Chris May 31 '22 at 19:51