I'm using PDFBox PDFTextStripper for text extraction. I also need to get color information for each character, ideally in writeString method. What I found, is this solution for PDFBox 1.8 (actually can be easy converted to 2.0 version), and what else i'm looking for is background color for each character (as in that answer there is only character color). I added all handlers for Fill operators - CloseFillNonZeroAndStrokePath, CloseFillEvenOddAndStrokePath FillNonZeroAndStrokePath, FillEvenOddAndStrokePath, LegacyFillNonZeroRule, FillNonZeroRule, FillEvenOddRule (like suggested in this topic), and inside those operators get nonStrokingColor:
public final class FillEvenOddRule extends OperatorProcessor {
@Override
public void process(Operator operator, List<COSBase> operands) throws IOException {
linePath.setWindingRule(GeneralPath.WIND_EVEN_ODD);
deleteCharsInPath();
linePath.reset();
PDGraphicsState gs = getGraphicsState();
PDColor nonStrokingColor = gs.getNonStrokingColor();
fillColor = nonStrokingColor.toRGB();
}
@Override
public String getName() {
return "f*";
}
}
Then in processTextPosition I tried to get this fillColor and put it to map for each character (assuming content stream work consecutive way - after Fill operator completes, all next coming to processTextPosition characters should have this fillColor. However this is not truth and all characters have wrong color. There is file I'm trying to process, each second row has blue filling, and I would like to get that blue color for each character in such row, and white color for each character in white row. Is it possible with PDFBox?