0

I have tested the code provided in this thread. It can find all text elements which are included in an image bounding box. But how can you differ between text behind the image and text above the image ?

Mark Rotteveel
  • 100,966
  • 191
  • 140
  • 197
Orit
  • 183
  • 2
  • 10
  • Have you checked [this older answer](https://stackoverflow.com/a/20179928/1729265)? It provides a proof-of-concept for text extraction that ignores text covered by bitmap images. – mkl Oct 25 '21 at 09:01
  • Yes, I had to port it, replace `context.processSubStream` with `context.showForm`. Unfortunately It did not lead to the desired results. Maybe I did not mention the correct value for `OperatorProcessor.getName()`. What should be the value ? I understand you used `"Do"`, but it did not work for me. – Orit Oct 25 '21 at 09:30
  • 1
    Sorry, I have re-checked my code, did some cleanouts, and now it works !!!! Thank you @mkl !! Will paste the updated ported code in an answer below. – Orit Oct 25 '21 at 09:42
  • *"I had to port it"* - yes, considering the age of that answer, it may well have been a PDFBox 1.x answer. - *"now it works !!!!"* - Great! – mkl Oct 25 '21 at 10:06
  • @mkl The code is OK for PDF pages with `pdPage.getCropBox().getLowerLeftY() == 0` and `getLowerLeftX() == 0`. Any help with this [old thread comment](https://stackoverflow.com/questions/19809813/how-to-check-if-a-text-is-transparent-with-pdfbox) would be much appriciated: `For a generic solution you have to change this test to something that checks whether the 1x1 square transformed by the Matrix ctm = getGraphicsState().getCurrentTransformationMatrix() overlaps the character box ... ` – Orit Oct 29 '21 at 11:23

1 Answers1

1

Below pasted is the code of the old answer mentioned above, ported to PDFBox 2.0.24. Main changes are:

  • getName() method added
  • context.processSubStream replaced with context.showForm
  • PDXObjectForm, PDXObjectImage replaced with the new class names PDFormXObject, PDImageXObject.
  • drawer.getResources().getXObjects(); replaced with drawer.getResources().getXObjectNames() and iteration over the XObjects collection is based on the getXObjectNames() returned value.
public final class CoveredText extends OperatorProcessor
{
    @Override
    public void process(Operator operator, List<COSBase> operands) throws IOException{
        PDFVisibleTextStripper drawer = (PDFVisibleTextStripper)context;
        for (COSName objectName: drawer.getResources().getXObjectNames()) {
            PDXObject xobject = drawer.getResources().getXObject(objectName);
            if ( xobject == null )
            {
                System.out.println("CoveredText.process Can't find the XObject for '"+objectName.getName()+"'");
            }
            else if( xobject instanceof PDImageXObject )
            {
                System.out.println("CoveredText.process " + objectName.getName()+" is a PDImageXObject");
                drawer.hide(objectName.getName());
            }
            else if(xobject instanceof PDFormXObject)
            {                   
                PDFormXObject form = (PDFormXObject)xobject;
                System.out.println("CoveredText.process " + objectName.getName()+" is a PDFormXObject at localtion " + form.getBBox().toString());
                Matrix matrix = form.getMatrix();
                if (matrix != null) 
                {
                    Matrix xobjectCTM = matrix.multiply( context.getGraphicsState().getCurrentTransformationMatrix());
                    context.getGraphicsState().setCurrentTransformationMatrix(xobjectCTM);
                }
                context.showForm(form);                    
            }               
        }
    } 
    @Override
    public String getName() {
        return "Do";
    }
}
Orit
  • 183
  • 2
  • 10
  • Can you explain what you change ? Where the issue come from and how you fix it ? – Elikill58 Oct 25 '21 at 11:46
  • Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Oct 25 '21 at 12:23
  • This code is OK for PDF pages with `pdPage.getCropBox().getLowerLeftY() == 0` and `getLowerLeftX() == 0`. Any help with this old thread comment would be much appriciated: `For a generic solution you have to change this test to something that checks whether the 1x1 square transformed by the Matrix ctm = getGraphicsState().getCurrentTransformationMatrix() overlaps the character box ... ` – Orit Oct 29 '21 at 11:19
  • location problem was solved by comparing the TextPosition to images bounding box, both in user space units. I have another issue with the following page: https://drive.google.com/file/d/14qy_GPS3dzXI-meJiCKkvqwUb59Q1yWk/view?usp=sharing I cannot get the string "ANNUAL REPORT 2018" which is printed behind th eimage on top right corner, to be detected as hidden (= covered), and the string "Destination2050" to be detected as visible = on top of image. Any help ? – Orit Nov 15 '21 at 10:02
  • I have removed the PDF sample from my cloud folder. It is available here: https://s25.q4cdn.com/680186029/files/doc_financials/ar-interactive/2018-interactive/ar/images/Xcel_Energy-AR2018.pdf The problem is on page 14 (zero based counting) – Orit Nov 23 '21 at 07:20