The OP clarified his question in a comment:
I'm wondering how to write a parser like PdfTextExtractor
or something else. I was excepting something like BaseParser
or so but found nothing. So I missed my way about it.
If you are in search for something like an editing framework, you can use the PdfContentStreamEditor presented in this answer.
Based on the PdfContentStreamEditor you can edit the content stream of the PDF pages like this:
PdfReader pdfReader = new PdfReader(resource);
PdfStamper pdfStamper = new PdfStamper(pdfReader, result);
PdfContentStreamEditor editor = new PdfContentStreamEditor()
{
@Override
protected void write(PdfContentStreamProcessor processor, PdfLiteral operator, List<PdfObject> operands) throws IOException
{
String operatorString = operator.toString();
if (TEXT_SHOWING_OPERATORS.contains(operatorString))
{
if (currentlyReplacedBlack == null)
{
BaseColor currentFillColor = gs().getFillColor();
if (BaseColor.BLACK.equals(currentFillColor))
{
currentlyReplacedBlack = currentFillColor;
super.write(processor, new PdfLiteral("rg"), Arrays.asList(new PdfNumber(0), new PdfNumber(1), new PdfNumber(0), new PdfLiteral("rg")));
}
}
}
else if (currentlyReplacedBlack != null)
{
if (currentlyReplacedBlack instanceof CMYKColor)
{
super.write(processor, new PdfLiteral("k"), Arrays.asList(new PdfNumber(0), new PdfNumber(0), new PdfNumber(0), new PdfNumber(1), new PdfLiteral("k")));
}
else if (currentlyReplacedBlack instanceof GrayColor)
{
super.write(processor, new PdfLiteral("g"), Arrays.asList(new PdfNumber(0), new PdfLiteral("g")));
}
else
{
super.write(processor, new PdfLiteral("rg"), Arrays.asList(new PdfNumber(0), new PdfNumber(0), new PdfNumber(0), new PdfLiteral("rg")));
}
currentlyReplacedBlack = null;
}
super.write(processor, operator, operands);
}
BaseColor currentlyReplacedBlack = null;
final List<String> TEXT_SHOWING_OPERATORS = Arrays.asList("Tj", "'", "\"", "TJ");
};
for (int i = 1; i <= pdfReader.getNumberOfPages(); i++)
{
editor.editPage(pdfStamper, i);
}
pdfStamper.close();
(ChangeTextColor.java test testChangeBlackTextToGreenDocument
)
In PdfContentStreamEditor the method write
is called for each instruction in the content stream and writes it back. By overriding this method and forwarding partially different instructions to the superclass write
, one can edit the stream.
This implementation shows how to change the color of text of a given color. In this case, black text is changed to green.
Beware, this is merely a proof-of-concept, not a final and complete solution. In particular
- Text is considered to be black if for its
color
the expression BaseColor.BLACK.equals(color)
is true
; as equality among BaseColor
and its descendant classes is not completely well-defined, this might lead to some false positives.
PdfContentStreamEditor
only inspects and edits the content stream of the page itself, not the content streams of displayed form xobjects or patterns; thus, some text may not be found.
Improving the class to properly detect black color and to recursively traverse and edit the content streams of used patterns and xobjects remains as an exercise for the reader.