How to use PDFBox to extract all text on a page that is NOT behind an image?

Question

I need to extract all text on a page that is not behind an image, OCR style.

So far, I use PrintImageLocations to get image locations. I do a translation from image coordinates to character coordinates. Then I use a modified version of PDFTextStripperByArea to get the text not behind any image location.

It works but... is there a simpler, one pass, way to get the text that is not behind an image?

Here is my modified version of PDFTextStripperByArea for retrieving text excluded from the areas entered:

package tester;

import java.awt.geom.Rectangle2D;
import java.io.IOException;
import java.io.StringWriter;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.text.PDFTextStripper;
import org.apache.pdfbox.text.TextPosition;
/**
 * This will extract text from a specified region in the PDF.
 *
 * @author Ben Litchfield
 */
public class PDFTextStripperByAreaAndExcluded_original extends PDFTextStripper
{
    private final ArrayList<List<TextPosition>> excludedCharacterList = new ArrayList<List<TextPosition>>();
    private StringWriter excludedText = new StringWriter();
    private final List<String> regions = new ArrayList<String>();
    private final Map<String, Rectangle2D> regionArea = new HashMap<String, Rectangle2D>();
    private final Map<String, ArrayList<List<TextPosition>>> regionCharacterList
            = new HashMap<String, ArrayList<List<TextPosition>>>();
    private final Map<String, StringWriter> regionText = new HashMap<String, StringWriter>();
    /**
     * Constructor.
     * @throws IOException If there is an error loading properties.
     */
    public PDFTextStripperByAreaAndExcluded_original() throws IOException
    {
        super.setShouldSeparateByBeads(false);
    }
    /**
     * This method does nothing in this derived class, because beads and regions are incompatible. Beads are
     * ignored when stripping by area.
     *
     * @param aShouldSeparateByBeads The new grouping of beads.
     */
    @Override
    public final void setShouldSeparateByBeads(boolean aShouldSeparateByBeads)
    {
    }
   /**
     * Add a new region to group text by.
     *
     * @param regionName The name of the region.
     * @param rect The rectangle area to retrieve the text from. The y-coordinates are java
     * coordinates (y == 0 is top), not PDF coordinates (y == 0 is bottom).
     */
    public void addRegion( String regionName, Rectangle2D rect )
    {
        regions.add( regionName );
        regionArea.put( regionName, rect );
    }
    /**
     * Delete a region to group text by. If the region does not exist, this method does nothing.
     *
     * @param regionName The name of the region to delete.
     */
    public void removeRegion(String regionName)
    {
        regions.remove(regionName);
        regionArea.remove(regionName);
    }
    
    /**
     * Get the list of regions that have been setup.
     *
     * @return A list of java.lang.String objects to identify the region names.
     */
    public List<String> getRegions()
    {
        return regions;
    }
    /**
     * Get the text for the region, this should be called after extractRegions().
     *
     * @param regionName The name of the region to get the text from.
     * @return The text that was identified in that region.
     */
    public String getTextForRegion( String regionName )
    {
        StringWriter text = regionText.get( regionName );
        return text.toString();
    }
    /**
     * Get the text excluded from all regions, this should be called after extractRegions().
     *
     * @return The text that was identified as not in any region.
     */
    public String getTextExcluded( )
    {
        return excludedText.toString();
    }
    /**
     * Process the page to extract the region text.
     *
     * @param page The page to extract the regions from.
     * @throws IOException If there is an error while extracting text.
     */
    public void extractRegions( PDPage page ) throws IOException
    {
        setStartPage(getCurrentPageNo());
        setEndPage(getCurrentPageNo());
        excludedCharacterList.add( new ArrayList<TextPosition>() );
        excludedText = new StringWriter();
        
        for (String region : regions)
        {
            setStartPage(getCurrentPageNo());
            setEndPage(getCurrentPageNo());
            //reset the stored text for the region so this class
            //can be reused.
            String regionName = region;
            ArrayList<List<TextPosition>> regionCharactersByArticle = new ArrayList<List<TextPosition>>();
            regionCharactersByArticle.add( new ArrayList<TextPosition>() );
            regionCharacterList.put( regionName, regionCharactersByArticle );
            regionText.put( regionName, new StringWriter() );
        }
        
        if( page.hasContents() )
        {
            processPage( page );
        }
    }
    
    /**
     * {@inheritDoc}
     */
    @Override
    protected void processTextPosition(TextPosition text)
    {
        boolean included = false;
        
        for (Map.Entry<String, Rectangle2D> regionAreaEntry : regionArea.entrySet())
        {
            Rectangle2D rect = regionAreaEntry.getValue();
            if (rect.contains(text.getX(), text.getY()))
            {
                included = true;
                charactersByArticle = regionCharacterList.get(regionAreaEntry.getKey());
                super.processTextPosition(text);
            }
        }
        
        if(!included) {
            charactersByArticle = excludedCharacterList;
            super.processTextPosition(text);
        }
    }
    
    /**
     * This will print the processed page text to the output stream.
     *
     * @throws IOException If there is an error writing the text.
     */
    @Override
    protected void writePage() throws IOException
    {
        for (String region : regionArea.keySet())
        {
            charactersByArticle = regionCharacterList.get( region );
            output = regionText.get( region );
            super.writePage();
        }
        
        charactersByArticle = excludedCharacterList;
        output = excludedText;
        super.writePage();
    }
}

One option would be to port the solution from [this old answer](https://stackoverflow.com/a/20179928/1729265) to the current pdfbox. — mkl, Mar 12 '21 at 23:40
Thanks mkl. I'm using pdfbox 2.0. I've got most of the code converted, I think. But at the end of the Do process method you have something I don't know how to convert. Any suggestions? context.processSubStream( context.getCurrentPage(), pdResources, formContentstream ); — Jay, Mar 15 '21 at 22:17
My first guess: See the 2.x org.apache.pdfbox.contentstream.operator.DrawObject.process(Operator, List) - depending on whether the form is a transparency group or not use `context.showTransparencyGroup` or `context.showForm`. — mkl, Mar 16 '21 at 11:01
@Jay can you please explain how do you differ text which is above the image (visible), from a text which is covered by the image (invisibale) ? Thanks — Orit, Oct 20 '21 at 12:38

How to use PDFBox to extract all text on a page that is NOT behind an image?

0 Answers0

Linked