I have a collection of PDFs that comprise of scanned images, which have then been OCR'd. The text is still displayed "graphically" - in other words, the scanned image text is still present - and the OCR'd text is "behind the image". This allows the documents to be searched, the text copied etc.
Due to a nasty (and now resolved) bug in OS X, some of the OCR'd text is corrupted. I'd like to therefore remove the text from the PDF, and re-OCR the document. For many non-trivial reasons, I don't want to go down the "re-print the document to a PDF" route: I'd prefer to try and repair the document in-place as much as possible.
As I can't find a PDF utility that will do what I'm asking, and I have a bit of coding experience, I've decided to roll up my sleeves and try to knock together a bit of .NET (C#) code to remove the text.
I've looked at iTextSharp, and I can open a sample document, but where I'm getting stuck is finding (and therefore, removing) just the text in a document. I've looked at various different PDF spec documents and I'm quickly getting lost, and all the examples I've seen for iTextSharp deal with adding objects, graphics or text to a document.
To summarise, all I want to do is find all the blocks of text and remove them, whilst leaving the graphic (originally JPG) images alone. Can anyone tell me what object types I should be looking for, and what hierarchy I should be iterating through, to achieve this?