7

I have a pdf file, it is necessary to delete certain text in it. Then add new text below to the existing one. I'm trying to use the PyMuPDF library - fitz. Open the file, set the text to search, but I did not find how to delete it and add new text. Please could you help me how to delete the found text and add to the existing one. Using libraries is not important, we can use PyPDF2 and others. The sample pdf file with description is attached.

import fitz
  
doc = fitz.open(MyFilePath)
page = doc[0]
  
text1 = “ANA”
text_instances1 = page.searchFor(text1)
  
# found text should be deleted …
  
text_to_add = “Text”
text2 = “TAIL NO.”
text_instances2 = page.searchFor(text2)
  
# should be added "text_to_add" after found text "text2"
  
doc.save(OutputFilePath, garbage=4, deflate=True, clean=True)

Picture

S.I.J
  • 979
  • 1
  • 10
  • 22
a_shvechkov
  • 71
  • 1
  • 2

2 Answers2

0

The library doesn't officially support adding/deleting text of a pdf document. However, from a recorded issue there is a workaround this. You can see the answer here from the author of the library on how you can get around this using a Text Modification method.

It also worries me that the documentation for the library seems to be unavailable. Not sure if this a permanent case but if so you should consider using a different library. You should see the answers here on the best alternative library - Add text to Existing PDF using Python

AzyCrw4282
  • 7,222
  • 5
  • 19
  • 35
  • 2
    Beware that that workaround explicitly is only *for text coded in ASCII* or Latin. If you eventually get arbitrary input documents, you cannot count on that, even if the text only used characters from the ASCII range. – mkl Jul 09 '20 at 05:02
  • See the section on [Extracting text from PDFs](https://pikepdf.readthedocs.io/en/latest/topics/content_streams.html#extracting-text-from-pdfs) of the [`pikepdf`](https://github.com/pikepdf/pikepdf) documentation for some background on what one might encounter. – Stefan Schmidt Jul 01 '23 at 22:03
0

disclaimer: I am the author of borb, the library used in this answer

Replacing text in a PDF is hard (as you have no doubt found out). The problem is that PDF contains (in the worst case) only the rendering instructions in order to put content on the page.

Your document might contain (in pseudo-code):

  1. go to 80, 700
  2. set the active font to Helvetica, size 12
  3. render the characters "Hell"
  4. move to 120, 700
  5. render the characters "o"
  6. move to 130, 700
  7. render the characters "World"

As you can see, there is no concept of "words". Letters can just be rendered wherever they happen to be needed. Spaces don't need to be included, software responsible for creating a PDF can just tell the renderer to move the cursor along the x-axis.

In order to replace text, you first need to find it.

from borb.pdf import PDF
from borb.toolkit import RegularExpressionTextExtraction

# RegularExpressionTextExtraction implements EventListener
# EventListener processes rendering events
# you can pass a regular expression to RegularExpressionTextExtraction
# and it will keep track of where that content appears
l: RegularExpressionTextExtraction = RegularExpressionTextExtraction("ANA")

# now we need to load the PDF
with open("input.pdf", "rb") as fh:
    PDF.loads(fh, [l])

# Now we can access the locations of the match(es).
# I am only going to use the first one, but feel free
# to update my code to take into account all matches
#
# A match can have multiple bounding boxes
# for instance if the regular expression could be matched over
# multiple lines of text.
print(l.get_matches_for_page(0)[0].get_bounding_boxes()[0])

Next step is to remove content at a given location. For this we can use redaction. Redaction erases content in a PDF.

from borb.pdf import PDF
from borb.pdf import Document
from borb.pdf import Page
from borb.pdf.canvas.layout.annotation.redact_annotation import RedactAnnotation

import typing

# open the PDF
doc: typing.Optional[Document] = None
with open("input.pdf", "rb") as fh:
    doc = PDF.loads(fh)

# get the first page
# maybe you'll need to modify this to apply it to all pages
# keep that in mind
page: Page = doc.get_page(0)

# add the redaction annotation
page.add_annotation(
        RedactAnnotation(
            Rectangle(Decimal(405), Decimal(721), Decimal(40), Decimal(8))
            )
        )
    )

# apply redaction annotations
page.apply_redact_annotations()

# now we can store the PDF again
with open("input_002.pdf", "wb") as out_file_handle:
    PDF.dumps(out_file_handle, doc)

Lastly, we need to put some content back in the PDF, at the location that we previously removed content from.

from borb.pdf import PDF
from borb.pdf import Document
from borb.pdf import Page
from borb.pdf import Paragraph

import typing

# load the PDF
doc: typing.Optional[Document] = None
with open("input.pdf", "rb") as fh:
    doc = PDF.loads(fh)

# add a Paragraph at an absolute location
# fmt: off
r: Rectangle = Rectangle(
    Decimal(59),                # x: 0 + page_margin
    Decimal(848 - 84 - 100),    # y: page_height - page_margin - height_of_textbox
    Decimal(595 - 59 * 2),      # width: page_width - 2 * page_margin
    Decimal(100),               # height
    )
# fmt: on

# the next line of code uses absolute positioning
page: Page = doc.get_page(0)
Paragraph("Hello World!").paint(page, r)

# store the PDF
with open("output.pdf", "wb") as fh:
    PDF.dumps(fh, doc)

borb is an open source, pure Python PDF library that creates, modifies and reads PDF documents. You can download it using:

pip install borb

Alternatively, you can build from source by forking/downloading the GitHub repository.

Joris Schellekens
  • 8,483
  • 2
  • 23
  • 54