I have a task where I have to extract text which are behind images and have been OCR-ed from the image itself. This text is transparent. The problem is there is an image which has text behind it which is not OCR-ed, it is just normal text and it is not transparent. How can I differentiate between the needed (transparent) and the not-needed (non-transparent) text?
Here is a representative pdf file: https://easyupload.io/rbo333 Image OCR text should be extracted on page 2,3,12 but text is also extracted on page 4. On page 4 there is no OCR text behind images, but there is regular text under the image. I need to somehow filter that out as I only need OCR text.