0

Using iTextSharp, how can I determine if a parsed chunk of text is both bolded and underlined?

Details:
I'm trying to parse .PDF files in C# specifically for text that is both bolded and underlined. Using ITextSharp, I can derive from LocationTextExtractionStrategy and get the text, the location, the font, etc. from the iTextSharp.text.pdf.parser.TextRenderInfo object passed to the overridden .RenderText method.
However, determining if the text is Bold and/Underlined from the TextRenderInfo object has not been straight forward.

  • I tried to use TextRenderInfo.GetFont() to find the font properties, but was unsuccessful
  • I can currently determine if the text is Bold or not, by accessing the private Graphics State field on the TextRenderInfo object and checking it's .Font.PostscriptFontName property for the word "Bold" (Ugly, but appears to work.)
  • Biggest issue: I haven't found anything to determine if the text is underlined. How can I determine this?

Here is my current attempt:

        private FieldInfo _gsField = typeof(TextRenderInfo).GetField("gs",
        BindingFlags.GetField | BindingFlags.NonPublic | BindingFlags.Instance);

        //Automatically called for each chunk of text in the PDF
        public override void RenderText(TextRenderInfo renderInfo)
        {
            base.RenderText(renderInfo);
            //UNDONE:Need to determine if text is underlined.  How?

            //NOTE: renderInfo.GetFont().FontWeight does not contain any actual information
            var gs = (GraphicsState)_gsField.GetValue(renderInfo);
            var textChunkInfo = new TextChunkInfo(renderInfo);
            _allLocations.Add(textChunkInfo);
            if (gs.Font.PostscriptFontName.Contains("Bold"))
                //Add this to our found collection
                FoundItems.Add(new TextChunkInfo(renderInfo));

            if (!_lineHeights.Contains(textChunkInfo.LineHeight))
                _lineHeights.Add(textChunkInfo.LineHeight);
        }

Full source code of current attempt at: GitHub Repository (Two examples (example.pdf and example2.pdf) are included with text similar to what I'll be searching through.)

SvdSinner
  • 951
  • 1
  • 11
  • 23
  • Possible duplicate of [What are the ways of checking if piece of text in PDF documernt is bold using iTextSharp](https://stackoverflow.com/questions/28065269/what-are-the-ways-of-checking-if-piece-of-text-in-pdf-documernt-is-bold-using-it) – Ryanas Mar 29 '18 at 20:16
  • @Ryan that is not really a duplicate, that question and answer focuses on a specific pdf for which the bold recognition of the op failed. This question is about the more generic case. – mkl Mar 30 '18 at 12:11
  • @RyanSingh additionally, that is about detecting Bold text, which I already can do (although that link shows a better way). It does not cover underlined text, which is the part I have no answer for – SvdSinner Mar 30 '18 at 13:01

2 Answers2

2
  • I tried to use TextRenderInfo.GetFont() to find the font properties, but was unsuccessful

  • I can currently determine if the text is Bold or not, by accessing the private Graphics State field on the TextRenderInfo object and checking it's .Font.PostscriptFontName property for the word "Bold" (Ugly, but appears to work.)

I don't quite understand this differentiation. TextRenderInfo.GetFont() is exactly the same as the Font property of the private Graphics State field of TextRenderInfo.

That being said, though, this is indeed one of the major ways to determine boldness.

Bold writing in PDFs is achieved either using

  • explicitly bold fonts (which is the better way); in this case one can try to determine whether or not the fonts are bold by

    • looking at the font name: it may contain a substring "bold" or something similar;

    • looking at some optional properties of the font, e.g. font weight, but beware, they are optional...

    • inspecting the embedded font file if applicable.

    Neither of these methods is fool-proof;

  • the same font as for non-bold text but using special techniques to make them appear bold (aka poor man's bold), e.g.

    • not only filling the glyph contours but also drawing a thicker line along it for a bold impression,

    • drawing the glyph twice, the second time slightly displaced, also for a bold impression.

Underlined writing in PDFs is usually achieved by explicitly drawing a line or a very thin rectangle under the text. You can try and detect such lines by implementing IExtRenderListener, parsing the page in question with it to determine line locations, and then match with text positions during text extraction. Both can also be done in a single pass but beware, the underlines need not be drawn before the text or even shortly thereafter, the pdf producer may first draw all text and only then draw all underlines. Furthermore, I've also come across a funny construction, very short (e.g. 1pt) very wide (e.g. 50pt) vertical lines effectively are seen as horizontal ones...

IExtRenderListener extends the IRenderListener with three new methods, ModifyPath, RenderPath, and ClipPath. Whenever some path is drawn, be it a single line, a rectangle, or some very complex path, you'll first get a number of ModifyPath calls (at least one)

/**
 * Called when the current path is being modified. E.g. new segment is being added,
 * new subpath is being started etc.
 *
 * @param renderInfo Contains information about the path segment being added to the current path.
 */
void ModifyPath(PathConstructionRenderInfo renderInfo); 

defining the lines and curves the path consists of, then at most one ClipPath call

/**
 * Called when the current path should be set as a new clipping path.
 *
 * @param rule Either {@link PathPaintingRenderInfo#EVEN_ODD_RULE} or {@link PathPaintingRenderInfo#NONZERO_WINDING_RULE}
 */
void ClipPath(int rule);

(if and only if the path shall serve as clip path for the following drawing operations), and finally exactly one RenderPath call

/**
 * Called when the current path should be rendered.
 *
 * @param renderInfo Contains information about the current path which should be rendered.
 * @return The path which can be used as a new clipping path.
 */
Path RenderPath(PathPaintingRenderInfo renderInfo);

defining how that path shall be drawn (any combination of filling its interior and stroking the path itself).

I.e. for recognizing underlines, you'll have to collect the path pieces provided via ModifyPath and decide whether they might describe one or more underlines as soon as the RenderPath call comes.

Theoretically underlines could also be created differently, e.g. using a bitmap image, but I'm not aware of pdf producers doing so.

By the way, in your example PDF underlines appear consistently to be drawn using a MoveTo to the line starting point, a LineTo to its end, and then a Stroke to simply stroke the path. Thus, you'll get two ModifyPath calls (one with operation value MOVETO, one with LINETO) and one RenderPath call (with operation STROKE) respectively for each underline.

mkl
  • 90,588
  • 15
  • 125
  • 265
  • While I seem to see the underline lines in the IExtRenderListener.PaintPath method, I can't figure out where they are drawn. How can I get the location of the line from the PathPaintingRenderInfo object passed to the method? – SvdSinner Mar 30 '18 at 18:03
  • I added a short explanation of the `IExtRenderListener` interface to my answer. Essentially you first get path sections via its `ModifyPath` and the following `RenderPath` (I assume you meant that method when you wrote PaintPath) tells you how these sections are drawn. – mkl Mar 30 '18 at 20:11
  • I ran into problems with the underline algorithm with documents like Example2.pdf (Link added in question) It follows the same pattern of a MOVETO followed by a LINETO, However, the MOVETO is always to (0,0) and the LINETO is always ({Negative number}, 0) Additionally, the render info objects have a non-standard transform matrix, where the value[Matrix.I31] is similar to the Right Margin X coordinate, and the value[Matrix.I32] is between 750 and 780. Is it possible to get the absolute coordinates of the underlines? – SvdSinner Apr 03 '18 at 20:26
  • Yes. Simply multiply the MoveTo and LineTo coordinates with the current transformation matrix. – mkl Apr 04 '18 at 07:06
  • I just inspected your Example2.pdf. As mentioned above ***Underlined** writing in PDFs is usually achieved by explicitly drawing a line or a very thin rectangle under the text.* You don't get MOVETO and LINETO for the underlines in your second document because in there the underlines are created using thin rectangles, i.e. you get a `ModifyPath` call with operation `RECT`. – mkl Apr 04 '18 at 08:15
0

In DOCOTIC.pdf library there is a method responding as true or false. In C# bool FONT_ITALIC = data.Font.Italic; bool FONT_UNDERLINE = data.Font.Underline;

Check for the value of FONT_ITALIC/FONT_UNDERLINE.

I have tried to use the same, but couldn't get correct value always.

Any suggestions are welcome.

pai
  • 1
  • 2
  • Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Feb 22 '22 at 17:29
  • This does not really answer the question. If you have a different question, you can ask it by clicking [Ask Question](https://stackoverflow.com/questions/ask). To get notified when this question gets new answers, you can [follow this question](https://meta.stackexchange.com/q/345661). Once you have enough [reputation](https://stackoverflow.com/help/whats-reputation), you can also [add a bounty](https://stackoverflow.com/help/privileges/set-bounties) to draw more attention to this question. - [From Review](/review/late-answers/31132601) – Filburt Feb 26 '22 at 11:35