0

I have a task where I have to extract text which are behind images and have been OCR-ed from the image itself. This text is transparent. The problem is there is an image which has text behind it which is not OCR-ed, it is just normal text and it is not transparent. How can I differentiate between the needed (transparent) and the not-needed (non-transparent) text?

Here is a representative pdf file: https://easyupload.io/rbo333 Image OCR text should be extracted on page 2,3,12 but text is also extracted on page 4. On page 4 there is no OCR text behind images, but there is regular text under the image. I need to somehow filter that out as I only need OCR text.

  • You might be interested in [this answer](https://stackoverflow.com/a/20179928/1729265), it shows a proof of concept of a text stripper class ignoring text covered by images. Beware, that answer is quite old, so some details may meanwhile have changed in PDFBox. If you cannot make that `VisibleTextStripper` work for you, please come back and share a representative example PDF for us to test with. – mkl Jul 13 '21 at 14:39
  • @mkl Thank you, I looked into it and it looks like, and correct me if I am wrong, it determines based on coordinates whether the character is in the image. Now that would be good, but I already extract characters that are in the image, as I need to extract searchable text from OCR-ed images. So I need to somehow differentiate between regular text which is "just behind" the image and text which is "on" the image and OCR-d. I edited my question with a link to the representative pdf. – Szőke Attila Jul 14 '21 at 06:51
  • You implicitly differentiate between those cases. Text "behind the image" is drawn before the image is drawn. Thus, that class has found that text when processing the image and removes it. Text "on the image" is drawn after the image. Thus, that class will find it after processing the image and doesn't remove it. – mkl Jul 14 '21 at 07:17
  • Please clarify. At first I thought you only wanted text which is not covered by an image. That's what the old class in the referenced answer extracts. But you now appear not to want that. Maybe it's the other way around and you want only the text covered by an image? Or only the text which is drawn on some image? – mkl Jul 14 '21 at 07:31
  • @mkl Well, there are images which have OCR-ed, searchable text on them. Now I need to extract only that text from the document, which works fine. But there is an image on the fourth page, which has regular text behind it, which is not OCR-ed. I need to differentiate that regular text from the OCR-ed text somehow. I had the impression that the regular text is "behind" the image and the OCR-ed text is "in front" of the image but I am not sure anymore. – Szőke Attila Jul 14 '21 at 07:43
  • @mkl I edited my question as I now understand the task a little better, hopefully it is more clear now, sorry for being cryptic I did not fully understand the problem, hopefully now I do. – Szőke Attila Jul 15 '21 at 10:12
  • *"text which are behind images and have been OCR-ed from the image itself. This text is transparent."* - This text is not necessarily _behind the image_, e.g. on page 2 it's in front of the image. As it's transparent, though, you cannot see it. – mkl Jul 15 '21 at 13:59
  • It is fairly easy to extract only text which is in the image area (either behind or in front) but you cannot be sure whether it is OCR'ed text or not. Maybe there first was the text and later the image was added to hide that text... Would that suffice? – mkl Jul 15 '21 at 14:08
  • @mkl I don't think it would in every scenario, I don't know the precise ordering and the task should be as general as possible as to work with every pdf. Interesting on the second page, the text is on top and transparent? Can that transparency be detected with pdfbox somehow? Could you tell me how it was made transparent? I tried watching character colors but it is inconsistent that way. – Szőke Attila Jul 16 '21 at 07:44
  • On the second page the text is transparent by using a font in which all glyphs are empty. This might be detectable using the font subproject of PDFBox but I've never dived into that stuff. This also explains why colors give an inconsistent impression - empty glyphs always look the same, no matter what color is selected... – mkl Jul 16 '21 at 09:17
  • @mkl I think I got it, thank you for suggesting the font stuff, that was it. I found a way to detect the character RenderingMode, and OCR image text is RenderingMode.NEITHER so it can be filtered based on that. I will write and answer about it. – Szőke Attila Jul 20 '21 at 06:29

1 Answers1

0

So the images have in front of them or behind them transparent text. I thought that meant that they have no color, but @mkl said that they might have colors, but they are empty glyphs. The pdf specification also states that they can have color even if they are transparent. To be truly transparent the characters need to be rendered with neither stroking, nor non-stroking colors.

There is a RenderingMode enum in PDFBox, or Fontbox for exactly this purpose and its NEITHER value denotes whether something is transparent. I could extract it with the help of this answer.

The solution code looks like this.

@Override
protected void processTextPosition(TextPosition character) {
    characterRenderingModes.put(character, getGraphicsState().getTextState().getRenderingMode());
    super.processTextPosition(character);
}

This is an overriden method of the PDFTextStripper class and it goes through every character on the page/s and gets their RenderingModes. After that when needed I get the RenderingModes out of the map based on the characters I needed to examine.

  • Please be aware that inspecting the **RenderingMode** alone works for some documents but doesn't work for others. PDF is a very versatile format and often allows multiple ways to achieve the same effect. – mkl Jul 20 '21 at 07:09
  • @mkl Yes you are right, it does not cover all the possibilities. I had multiple pdfs to try it with and this method had the best results, but if something new turns up I will be sure to update the answer. – Szőke Attila Jul 21 '21 at 07:18