0

Currently I am doing PDF evaluation using iText.When evaluating I am facing problems with artificial BOLD and artificial Outline style. Can any body please help me out to see if artificial bold style and artificial outline style in a PDF document can be checked using iText API? Please find the sample PDF below:

https://docs.google.com/file/d/0BzaBYVk1XnP_SGRqRDBwTG8tVUE/edit?pli=1

Shreyos Adikari
  • 12,348
  • 19
  • 73
  • 82
  • The manner in which these artificial styles are created, has already been explained in [this answer](http://stackoverflow.com/questions/20878170/how-to-determine-artificial-bold-style-artificial-italic-style-and-artificial-o/20924898#20924898) to the parallel question which focused on using PDFBox instead of iText. – mkl Jan 17 '14 at 07:57

1 Answers1

2

iText does allow you to recognize the artificial bold and outline styles used in your document.

How these styles are created

The document with the artificial styles has already been the focus of a former question, How to determine artificial bold style ,artificial italic style and artificial outline style of a text using PDFBOX, and a more detailed description of how those styles are created in your document, can be found in my answer there.

A short summary:

  • Artificial bold text is created by first drawing the letter in regular mode, filling the letter area, and then drawing it in outline mode, drawing a line along the letter border, both in black, CMYK 0, 0, 0, 1; this leaves the impression of a thicker letter.

  • Artificial outline text is created by first drawing the letter in regular mode in white, CMYK 0, 0, 0, 0, filling the letter area, and then drawing it in outline mode, drawing a line along the letter border, in black, CMYK 0, 0, 0, 1; this leaves the impression of an outlined black on white letter.

How to recognize these styles

Knowing how those styles are created, you can try and recognize these creation patterns using iText. iText fortunately forwards the required information in its parsing events.

Obviously the text extraction strategies included with iText do not recognize those artificial styles out of the box as there are so many different ways one can create such artificial styles and style extraction is not a focus of iText text extraction at all. Thus, you have to create your own text extraction strategies.

You can get a first impression of the events which are forwarded to your strategy by using the following test render listener (text extraction strategies are special render listeners with an additional method for requesting the text collected by the listener; the sample listener here outputs to stdout; thus, that extra method is not needed):

class StyleAnalyzer implements RenderListener
{
    public void beginTextBlock()                        { }
    public void endTextBlock()                          { }
    public void renderImage(ImageRenderInfo renderInfo) { }

    public void renderText(TextRenderInfo renderInfo)
    {
        System.out.printf("%s - %d - %s - %s - %s\n",
            renderInfo.getBaseline().getStartPoint(),
            renderInfo.getTextRenderMode(),
            toString(renderInfo.getFillColor()),
            toString(renderInfo.getStrokeColor()),
            renderInfo.getText());
    }

    String toString(BaseColor color)
    {
        if (color instanceof CMYKColor)
        {
            CMYKColor cmyk = (CMYKColor) color;
            return String.format("CMYK[%3.1f %3.1f %3.1f %3.1f]",
                cmyk.getCyan(), cmyk.getMagenta(), cmyk.getYellow(), cmyk.getBlack());
        }
        return String.valueOf(color);
    }
}

You can use it like this:

PdfReader reader = new PdfReader("artificial text.pdf");
PdfReaderContentParser parser = new PdfReaderContentParser(reader);

System.out.println("start point - rendering mode - fill color - stroke color - text\n");
parser.processContent(1, new StyleAnalyzer());

For the artificial bold text you get outputs like the following (slightly re-formatted) for the word "This":

start point - rendering mode - fill color - stroke color - text
 66.36,729.86,1.0 - 0 - CMYK[0,0 0,0 0,0 1,0] - null                  - T
 66.36,729.86,1.0 - 1 - CMYK[0,0 0,0 0,0 1,0] - CMYK[0,0 0,0 0,0 1,0] - T
 81.11,729.86,1.0 - 0 - CMYK[0,0 0,0 0,0 1,0] - null                  - h
 81.11,729.86,1.0 - 1 - CMYK[0,0 0,0 0,0 1,0] - CMYK[0,0 0,0 0,0 1,0] - h
 96.11,729.86,1.0 - 0 - CMYK[0,0 0,0 0,0 1,0] - null                  - i
 96.11,729.86,1.0 - 1 - CMYK[0,0 0,0 0,0 1,0] - CMYK[0,0 0,0 0,0 1,0] - i
104.86,729.86,1.0 - 0 - CMYK[0,0 0,0 0,0 1,0] - null                  - s
104.86,729.86,1.0 - 1 - CMYK[0,0 0,0 0,0 1,0] - CMYK[0,0 0,0 0,0 1,0] - s

So for each letter you get two text rendering calls, both at the same position and for the same text, the first in rendering mode 0 (fill), the second in mode 1 (stroke), the relevant color always being black in CMYK.

For the artificial outline text you get outputs like the following (slightly re-formatted) for the word "This":

     66.0,661.75,1.0 - 0 - CMYK[0,0 0,0 0,0 0,0] - null                  - T
     66.0,661.75,1.0 - 1 - CMYK[0,0 0,0 0,0 0,0] - CMYK[0,0 0,0 0,0 1,0] - T
     79.0,661.75,1.0 - 0 - CMYK[0,0 0,0 0,0 0,0] - null                  - h
     79.0,661.75,1.0 - 1 - CMYK[0,0 0,0 0,0 0,0] - CMYK[0,0 0,0 0,0 1,0] - h
     92.5,661.75,1.0 - 0 - CMYK[0,0 0,0 0,0 0,0] - null                  - i
     92.5,661.75,1.0 - 1 - CMYK[0,0 0,0 0,0 0,0] - CMYK[0,0 0,0 0,0 1,0] - i
    99.25,661.75,1.0 - 0 - CMYK[0,0 0,0 0,0 0,0] - null                  - s
99.250015,661.75,1.0 - 1 - CMYK[0,0 0,0 0,0 0,0] - CMYK[0,0 0,0 0,0 1,0] - s

So for each letter you get two text rendering calls, both at (nearly) the same position and for the same text, the first in rendering mode 0 (fill) with white in CMYK, the second in mode 1 (stroke) with black in CMYK.

Thus, in your render listener you will have to look for such text rendering call patterns.

You might want to start by copying the LocationTextExtractionStrategy, extend its TextChunk helper class with rendering mode and color information, and fill those fields accordingly when creating instances in renderText.

As soon as all page events have been digested, you can glue the chunks together in a method working similar to LocationTextExtractionStrategy.getResultantText(TextChunkFilter). In addition to that existing implementation, though, you have to also check for chunks at the same (or nearly the same, see the final outline 's' above) position. If they contain the same text and their rendering modes and associated colors show the pattern from above, you have artificial bold or outline text and can treat it as you see fit.

BTW, while iText forwards the information required for recognizing these artificial styles, it does not directly allow access the transformation matrix in its TextRenderInfo objects. You need this, though, to recognize the artificial italics style as explained in my answer to the PDFBox-related question. It is present there, though, as a private member textToUserSpaceTransformMatrix as of version 5.4.5. Using reflection, therefore, you are able to also access that member (if no security manager forbids it) and recognize artificial italics, too.

Community
  • 1
  • 1
mkl
  • 90,588
  • 15
  • 125
  • 265