How to replace text in hidden text layer of pdf?

Question

I have to remove sensitive information from pdf. I want to do this in both the image layer and the text layer. I managed to get half the target result using the fitz library. This is the code I use, in a simplified form.

phrase_to_redact = 'example'
document = fitz.open(path)
for page in document:
  rects = page.searchFor(phrase_to_redact)
  for rect in rects:
    page.addRedactAnnot(rect, fill=color)
  page.apply_redactions()

This code gets me a pdf where the phrase I want to censor is blurred with a filled rectangle. When I select the text that has a part blurred in it, copy and paste into notepad I get the copied piece without the censored word (without the part that is hidden behind the rectangle). What I would like to achieve is that when the text is copied, there are neutral characters of the length of the deleted word in place of that word. What I can additionally do using the fitz library is to fill in another selected phrase in place of the censored word. Then the code should look like this.

phrase_to_redact = 'example'
document = fitz.open(path)
for page in document:
  rects = page.searchFor(phrase_to_redact)
  for rect in rects:
    page.addRedactAnnot(rect, text='example_phrase', fill=color)
  page.apply_redactions()

This way, a new phrase appears in place of the censored word visually in the pdf, but when I copy the fragment containing the new word, the gap created by censoring the word is still empty. To copy a newly inserted word I need to select only that word. I checked how the blocks on the page look after such an edit with this code.

document = fitz.open(path)
for page in document:
  blocks = page.getText("dict")["blocks"]
  print(blocks)

And I noticed that censored word is deleted from lines in the blocks, and new blocks with the new phrase are added to the end of the block list. So blocks are not ordered according to visual occurrence, but according to the order in which they are added. So when I extract text from the whole page, the newly inserted phrases appear at the very end and it's not clear where the word they replace comes from.

Is there any way to replace a text with another in pdf so that the new text will be in place of the old one when copy pasting the text from the edited pdf page? I have searched the internet, but all I find is about editing the image layer which doesn't work with copying edited content in the way I want it to.

Just a side note: *"What I would like to achieve is that when the text is copied, there are neutral characters of the length of the deleted word in place of that word."* - Are you sure you want that? Consider for example a table with a column with only "yes" and "no" entries: If redaction would leave behind that column blurred but with neutral text two or three characters long to copy in each cell, you would immediately know the original, former content... — mkl, Sep 13 '21 at 15:01
Have you seen this discussion? https://github.com/pymupdf/PyMuPDF/issues/434 — Brian Z, Apr 23 '22 at 03:29

How to replace text in hidden text layer of pdf?

0 Answers0