-1

Simple question I hope - I have a pdf and want to detect the co-ordinates of specific word(s) or placeholder text. I then intend to use itextsharp to stamp a replacement bit of text on top at the co-ordinates found.

Can anyone recommend anything please?

Thanks

dbc
  • 104,963
  • 20
  • 228
  • 340
Ash
  • 23
  • 4
  • Do you know the PDF is text-searchable? – Kevin May 18 '21 at 15:39
  • 2
    https://stackoverflow.com/questions/2375674/itextsharp-how-to-get-the-position-of-word-on-a-page – Kevin May 18 '21 at 15:47
  • Thank you @Kevin - although you linked to an answer about the older iText 5. You can do the same with the current iText 7. The code might be slightly different but I don't have an example at hand. – Amedee Van Gasse May 19 '21 at 08:47
  • Hi, yes I know you can do do text searches, but most solutions I've seen are not 'accurate' enough and sometimes give co-ords of the start of the sentence the search text is in due to the way text is stored in chunks in pdfs. I don't think I've looked at this specific post in the past so I will review, but I've burned a lot of time with itextsharp lately trying a different methods out and thought it might be time to buy a commercial solution instead if such a thing exists. – Ash May 19 '21 at 08:50
  • *"but most solutions I've seen are not 'accurate' enough and sometimes give co-ords of the start of the sentence the search text is in due to the way text is stored in chunks in pdfs."* - if that is important to you, why didn't you mention that in your question. E.g. at first glance iText only gives you the chunks but if you look at the API again, you'll find methods to return coordinates of each glyph. – mkl May 27 '21 at 11:09
  • Well, I did say 'specific words'? - however, I will look at itextsharp again. Thanks. – Ash May 28 '21 at 08:04

1 Answers1

1

As answered in the comments, one could use iText to perform such a task. Maybe there are some better solutions, however, I doubt it. The cause of the mentioned issue, i.e. "[itextsharp] sometimes give co-ords of the start of the sentence the search text is in", is that sometimes glyphs are so close, that their boxes overlap, hence I don't see how it could be handled as you want.

So you can do the following:

  • extend LocationTextExtractionStrategy class and override eventOccurred, for example, as follows:

     @Override
     public void eventOccurred(IEventData data, EventType type) {
     if (type.equals(EventType.RENDER_TEXT)) {
         TextRenderInfo renderInfo = (TextRenderInfo) data;
         // Obtain all the necesary information from renderInfo, for example
         LineSegment segment = renderInfo.getBaseline();
         // ...
     }
    
  • pass an instance of such an extended class to PdfTextExtractor.getTextFromPage as follows:

    PdfTextExtractor.getTextFromPage(pdfDocument.getPage(1), new ExtendedLocationTextExtractionStrategy()
    
  • once text is found, the event will be triggered.

There are some difficulties in such a solution, of course, because the text you want to find and write above could be present in the PDF not as "Text", but "T", "ex", t", or even "t", "x", "e", "T". However, since you use iText, you may want to harness the advantages of one of its products - pdfSweep. This product aims to completely remove unnecessary content from the PDF, with such a content being passed either as some locations (which you want to obtain, so that is not an option) or regexes.

This is how to create such a regex strategy (to find all "Dolor" and "dolor" instances in the document, completely remove them (from all the streams, so that they are either not observed from a PDF viewer nor found in the underlying PDF objects):

RegexBasedCleanupStrategy strategy = new RegexBasedCleanupStrategy("(D|d)olor").setRedactionColor(ColorConstants.GREEN);

This is how to use it:

PdfAutoSweep autoSweep = new PdfAutoSweep(strategy);
autoSweep.cleanUp(pdf); // a PdfDocument instance

And this is how to write some text on the location, at which the unnecessary text was present:

for (IPdfTextLocation location : strategy.getResultantLocations()) {
        Rectangle rect = location.getRectangle();
        // do something, for exapmle, write some text
}
Uladzimir Asipchuk
  • 2,368
  • 1
  • 9
  • 19
  • This looke like a very complete answer - thanks. I'll try it out now. – Ash May 28 '21 at 08:05
  • After a lot of testing and trying out different approaches, I've found that e-iceblue's SpirePDF seems to be the best bet for this. – Ash Jun 29 '21 at 10:48