0

Goal: extract a value from a specific location inside a PDF page. In GemBox.Pdf, I can extract text elements including their bounds and content, but:

Problem: a text element can have a complex structure, with each glyph being positioned using individual settings.

Consider this common example of a page header:

Billing Info                        Date:   02/02/20222

Company Ltd.                Order Number:    0123456789
123 Main Street                     Name:   Smith, John              

Let's say, I want to get the order number (0123456789) from the document, knowing its precise position on the page. But in practice, often enough the entire line would be one single text element, with the content SO CompanyOrder Number:0123456789, and all positioning and spacing done via offsets and indices only. I can get the bounds and text of the entire line, but I need the bounds (and value) of each character/glyph, so I can combine them into "words" (= character sequences, separated by whitespace or large offsets).

I know this is definitely possible in other libraries. But this question is specific to GemBox. It seems to me, all the necessary implementations should already there, just not much is exposed in the API.

In itextsharp I can get the bounds for each single glyph, like this:

// itextsharp 5.2.1.0

public GlyphExtractionStrategy : LocationTextExtractionStrategy
{
    public override void RenderText(TextRenderInfo renderInfo)
    {
        var segment = renderInfo.GetBaseline();
        var chunk = new TextChunk(
            renderInfo.GetText(),
            segment.GetStartPoint(),
            segment.GetEndPoint(),
            renderInfo.GetSingleSpaceWidth(),
            renderInfo.GetAscentLine(),
            renderInfo.GetDescentLine()
        );
        // glyph infos
        var glyph = chunk.Text;
        var left = chunk.StartLocation[0];
        var top = chunk.StartLocation[1];
        var right = chunk.EndLocation[0];
        var bottom = chunk.EndLocation[1];
    }
}

var reader = new PdfReader(bytes);
var strategy = new GlyphExtractionStrategy();
PdfTextExtractor.GetTextFromPage(reader, pageNumber: 1, strategy);
reader.Close();

Is this possible in GemBox? If so, that would be helpful, because we already have the code to combinine the glphs into "words".

Currently, I can somewhat work around this using regex, but this is not always possible and also way too technical for end users to configure.

marsze
  • 15,079
  • 5
  • 45
  • 61
  • Do you have any reference how exactly this is done with iTextSharp? perhaps something similar can be done in GemBox.Pdf as well. – Mario Z Mar 19 '22 at 01:18
  • @KJ I am aware of this. And it doesn't need to be exact. As you see in the example, the offset would be quite large, so using some simple logic and reasonable thresholds would be enough for our use cases. But for that to work, I would need all that information, which I think GemBox doesn't provide at this time(?) I am already using a similar logic to merge adjacent text elements that should belong to the same word. – marsze Mar 19 '22 at 09:07
  • @marsze you have the location and the width of all text elements and also the font name and size, so the information is available, right? But you mentioned that iTextSharp has this feature, can you perhaps send a link that demostrates it? – Mario Z Mar 19 '22 at 09:39
  • @MarioZ I added an example. Basically, I can get the bounds of each single glyph. That information should already be available in GemBox internally. If I could get access to that somehow, that would already be helpful. (I changed the question title to that.) – marsze Mar 19 '22 at 11:32

1 Answers1

1

Try using this latest NuGet package, we added PdfTextContent.GetGlyphOffsets method:

Install-Package GemBox.Pdf -Version 17.0.1128-hotfix

Here is how you can use it:

using (var document = PdfDocument.Load("input.pdf"))
{
    var page = document.Pages[0];
    var enumerator = page.Content.Elements.All(page.Transform).GetEnumerator();

    while (enumerator.MoveNext())
    {
        if (enumerator.Current.ElementType != PdfContentElementType.Text)
            continue;

        var textElement = (PdfTextContent)enumerator.Current;
        var text = textElement.ToString();

        int index = text.IndexOf("Number:");
        if (index < 0)
            continue;

        index += "Number:".Length;
        for (int i = index; i < text.Length; i++)
        {
            if (text[i] == ' ')
                index++;
            else
                break;
        }

        var bounds = textElement.Bounds;
        enumerator.Transform.Transform(ref bounds);
                
        string orderNumber = text.Substring(index);
        double position = bounds.Left + textElement.GetGlyphOffsets().Skip(index - 1).First();

        // TODO ...
    }
}
Mario Z
  • 4,328
  • 2
  • 24
  • 38
  • Thanks, this was helpful. I know there is more to rendering the glyphs than just the offsets, so getting the actual bounds per glyph is not always 100% accurate, but for now, this should do for most cases. Looking forward to seeing more support for this in future versions. – marsze Mar 25 '22 at 16:56
  • 1
    I feel like `PdfTextContent.EncodedText` (`PdfEncodedContentString`) should contain the information I need, but it has no useful public members - not sure what's the purpose of that property? `PdfTextContent.ToString()` doesn't always represent the glyphs exactly (e.g. sometimes there are extra spaces). Also, calculating the bounds is usually trickier than in your example, because the offsets must be applied before any transforms, but I think I figured it out. – marsze Mar 27 '22 at 19:46
  • @marsze note that one of my colleagues will investigate this further when he's back (in a week or two) and if needed he'll make a different API for you. I hope that works for you. – Mario Z Mar 28 '22 at 08:52
  • Thanks. I appreciate the quick responses very much. For now, this will work in 90% of our use-cases. And I'm looking forward to more solid support in future versions. It's one of the primary tasks we use this library for. – marsze Mar 28 '22 at 18:54
  • @marsze we have investigated this requirement further and concluded that it will take quite some time to provide a nicer solution, so we will leave this for now. But note that we'll introduce a better API when we start working on this feature: https://support.gemboxsoftware.com/community/view/support-removal-redaction-of-information – Mario Z Apr 05 '22 at 09:03
  • Noted. I'll be watching the feature request. If there's any way I can help, feel free to reach out. – marsze Apr 05 '22 at 11:23