3

I have a collection of PDFs that comprise of scanned images, which have then been OCR'd. The text is still displayed "graphically" - in other words, the scanned image text is still present - and the OCR'd text is "behind the image". This allows the documents to be searched, the text copied etc.

Due to a nasty (and now resolved) bug in OS X, some of the OCR'd text is corrupted. I'd like to therefore remove the text from the PDF, and re-OCR the document. For many non-trivial reasons, I don't want to go down the "re-print the document to a PDF" route: I'd prefer to try and repair the document in-place as much as possible.

As I can't find a PDF utility that will do what I'm asking, and I have a bit of coding experience, I've decided to roll up my sleeves and try to knock together a bit of .NET (C#) code to remove the text.

I've looked at iTextSharp, and I can open a sample document, but where I'm getting stuck is finding (and therefore, removing) just the text in a document. I've looked at various different PDF spec documents and I'm quickly getting lost, and all the examples I've seen for iTextSharp deal with adding objects, graphics or text to a document.

To summarise, all I want to do is find all the blocks of text and remove them, whilst leaving the graphic (originally JPG) images alone. Can anyone tell me what object types I should be looking for, and what hierarchy I should be iterating through, to achieve this?

KenD
  • 5,280
  • 7
  • 48
  • 85
  • 2
    See this post: http://stackoverflow.com/a/12687519/231316. It is 100% possible although iTextSharp doesn't have any direct built-in helpers to do it. You'll need to be familiar with the PDF syntax and basically walk the document picking and choosing what you want to keep and/or get rid of. I've done similar ones but as Bruno pointed out there are many edge cases that you need to be aware of. If you're text is always on a dedicated OCG "layer" then you might also be able to use http://stackoverflow.com/a/17718641/231316 – Chris Haas Nov 24 '13 at 17:14

3 Answers3

4

Adapting this How to find and replace text in a existing PDF file with PDFTK (or other command line application) I was able to delete the rendered text by using pdftk and sed. This is surely not fully general, but was a quick hack for my needs.

I ended up with:

pdftk my_input.pdf output - uncompress | sed -e 's/\[.*\]TJ/()Tj/' -e 's/(.*)Tj/()TJ/' | pdftk - output my_output.pdf compress

This converts the streams to text format, where I find uses of (blah)Tj and [blah]TJ and just snip them out entirely, then convert back to compressed binary. pdftk does some magic to fix up the output so that it is valid again, because the original unedited input is also a valid PDF file, but not after editing. This will not work with extended characters without some new patterns.

Community
  • 1
  • 1
robmacl
  • 41
  • 4
  • Unfortunately, I get the error `sed: RE error: illegal byte sequence Error: Failed to open PDF file: - Errors encountered. No output created. Done. Input errors, so no output created.`. This is on OS X Mavericks – KenD Mar 22 '14 at 09:37
  • You can try the individual parts of the processing pipe to narrow the problem down. Looks like the initial uncompress of the PDF failed. Likely either your PDF is broken or it uses some feature not supported by pdftk. – robmacl Jul 14 '14 at 21:45
2

A long answer can be seen on: https://unix.stackexchange.com/questions/171940/how-can-i-convert-a-scanned-pdf-with-ocred-text-to-one-without-ocred-text#answer-181644

My short straight-to-go answer is this:

Well, as for my first question answered here (and i spent so much time here looking for answers), I´m using ubuntu 18.04, and I OCRd a pdf file.. it was looking fine, but with the images.. apparently the ocr I (and perhaps you too) used had the propose of adding a layer of text so you can search for text within the file... https://github.com/coherentgraphics/cpdf-binaries <-- the binaries necessary for the answer!

So, after I OCRd the file, I used the cpdf binaries with the following command:

"cpdf -draft ./MySourcePDF.pdf -o MyFinalPdf.pdf"

... from the documentation:

"The -draft option removes bitmap (photographic) images from a file, so that it can be printed with less ink. Optionally, the -boxes option can be added, filling the spaces left blank with a crossed box denoting where the image was. This is not guaranteed to be fully visible in all cases (the bitmap may be have been partially covered by vector objects or clipped in the original). For example:

cpdf -draft -boxes in.pdf -o out.pdf..."

So I didn't used the -boxes option. After that, I just opened the file with the LibreOffice Drawer and exported as PDF. Actually, you can do a lot of more stuff in there. Hope I help somebody to don't go through what I did today: 8 hours trying to fix a OCR PDF file for the person I share life with...

I eventually started trying to open the PDF with the LibreOffice, but the process goes high up and the PC becomes unusable.

Kathiresan Murugan
  • 2,783
  • 3
  • 23
  • 44
1

Printing the pdf in Apple Preview appears to remove the OCR as a side effect. Throw in Apple Script and you've got an automated solution.

pdb
  • 1,574
  • 12
  • 26
  • To quote from the question: "I don't want to go down the "re-print the document to a PDF" route"... – mkl Apr 17 '18 at 03:28