11

I have code that hides parts of the pdf (by just covering it with a white polygon) but the issue with this is, the text is still there, if you ctrl-f you can still find it.

My goal is to actually remove the text from the pdf itself. Using pdfminer I managed to extract the text from the pdf but I don't know if its possible to actually "replace" the text with say just some empty spaces. Is such a thing possible using python? Extracting it isn't enough. I need the text to be removed from the PDF

Wallace
  • 340
  • 1
  • 3
  • 9
  • with the specific tools, of course it is possible! look at this link I found on a short googling... https://www.binpress.com/manipulate-pdf-python/ – Cut7er Sep 15 '18 at 17:06
  • To quote from @Ryan's deleted answer (leaving out the advertisement part): *Yes, this is usually called Redaction, and involves completely removing text/graphics from the PDF file.* Redaction of PDFs is not trivial, so I don't know whether there are any free Python redaction tools. – mkl Aug 13 '19 at 11:36
  • 1
    Honestly guys, in the end, after like 5 hours of trying different methods via Python. I realized the smarter thing to do was to just use Adobe to redact the text. I appreciate whoever put the bounty but I'm way past the point of actually checking whether these new solutions will work. – Wallace Aug 14 '19 at 03:06
  • hi, @Wallace can you share any references how you achieved your goal programmatically with adobe redact? – Sandip Kumar Aug 14 '19 at 09:02
  • 1
    Hey, sorry for the long reply, this was about a year ago so I don't have any references. But Adobe (premium version) has a feature that can redact given text in one location on every single page of the document, I used that feature. I actually forgot the exact name of the adobe software I used because it was on a different laptop. I'm sorry. – Wallace Aug 22 '19 at 20:49

4 Answers4

1

I used pdf-redactor in one of my projects and it works pretty nice.

Here is an example how to redact Social Security Numbers from text layer.

Dragos Vasile
  • 453
  • 4
  • 12
  • How to use this library directly from python (without going over stdin and stdout?) – Stiefel Mar 12 '21 at 15:34
  • You can import the module in a script with `import pdf_redactor` . Check this example: https://github.com/JoshData/pdf-redactor/blob/primary/example.py – Dragos Vasile Mar 12 '21 at 15:40
  • But it then still uses stdin from the command line. I solved it adding another function to pdf_redactor that accepts an input and output filename. – Stiefel Mar 12 '21 at 16:02
  • However, it does not not work for my pdf. It creates a new pdf, but text is not replaced.I check with a simple sample (http://www.africau.edu/images/default/sample.pdf) there it worked. – Stiefel Mar 12 '21 at 16:03
  • Also using qpdf --stream-data=uncompress did not help. I know the pdf file was created from a simple MS Word file. So it should not be to exotic. – Stiefel Mar 12 '21 at 16:18
0

This is kind of memory intensive but you can copy the rest of the pdf apart from the part you are removing and then overwrite the file with the new version which does not contain the part you wish to remove. You can do this using PyPDF by retrieving a content stream and finding and removing the relevant parts.

PyPDF docs https://pythonhosted.org/PyPDF2/PageObject.html?highlight=getcontents#PyPDF2.pdf.PageObject.getContents;

PDF standard https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf pg 78, pg 81;

Xander Bielby
  • 161
  • 1
  • 12
0

I know I am late but for future readers here is a workaround I found to resolve this using pymupdf. This solution successfully deletes text from pdf.

page = doc.load_page(0)

draft = page.search_for("Invoice")

for rect in draft:
    annot = page.add_redact_annot(rect)
    page.apply_redactions()
    page.apply_redactions(images=fitz.PDF_REDACT_IMAGE_NONE)
# then save the doc to a new PDF:
doc.save("new.pdf", garbage=3, deflate=True)
Eternal
  • 928
  • 9
  • 22
-2

Is such a thing possible? Yes, although it is not recommended. In my opinion, your best bet is to open and read your existing file, move it to an editable format, remove whatever text that you don't want present and then convert it back.

However, you could extract the data and remove it from memory by using:

import PyPDF2 

# creating a pdf file object 
pdfFileObj = open('example.pdf', 'rb') 

# creating a pdf reader object 
pdfReader = PyPDF2.PdfFileReader(pdfFileObj) 

# printing number of pages in pdf file 
print(pdfReader.numPages) 

# creating a page object 
pageObj = pdfReader.getPage(0) 

# extracting text from page 
print(pageObj.extractText()) 

# closing the pdf file object 
pdfFileObj.close() 

Line by line, this program would:

pdfFileObj = open('example.pdf', 'rb') Open the example.pdf and save the file object as pdfFileObj.

pdfReader = PyPDF2.PdfFileReader(pdfFileObj) Create an object of PdfFileReader and pass the PDF file object whole getting a PDF reader object.

print(pdfReader.numPages) Give the number of pages.

pageObj = pdfReader.getPage(0) Create an object of PageObject class. PDF reader object has function getPage() which takes page number (starting form index 0) as an argument and returns the page object.

print(pageObj.extractText()) Extract text from the PDF page.

pdfFileObj.close() Close the PDF file object.

The replacement text would simply be "", as you want to remove all instances / cases of a certain piece of text.