Goal: extract a value from a specific location inside a PDF page. In GemBox.Pdf
, I can extract text elements including their bounds and content, but:
Problem: a text element can have a complex structure, with each glyph being positioned using individual settings.
Consider this common example of a page header:
Billing Info Date: 02/02/20222
Company Ltd. Order Number: 0123456789
123 Main Street Name: Smith, John
Let's say, I want to get the order number (0123456789
) from the document, knowing its precise position on the page. But in practice, often enough the entire line would be one single text element, with the content SO CompanyOrder Number:0123456789
, and all positioning and spacing done via offsets and indices only. I can get the bounds and text of the entire line, but I need the bounds (and value) of each character/glyph, so I can combine them into "words" (= character sequences, separated by whitespace or large offsets).
I know this is definitely possible in other libraries. But this question is specific to GemBox
. It seems to me, all the necessary implementations should already there, just not much is exposed in the API.
In itextsharp
I can get the bounds for each single glyph, like this:
// itextsharp 5.2.1.0
public GlyphExtractionStrategy : LocationTextExtractionStrategy
{
public override void RenderText(TextRenderInfo renderInfo)
{
var segment = renderInfo.GetBaseline();
var chunk = new TextChunk(
renderInfo.GetText(),
segment.GetStartPoint(),
segment.GetEndPoint(),
renderInfo.GetSingleSpaceWidth(),
renderInfo.GetAscentLine(),
renderInfo.GetDescentLine()
);
// glyph infos
var glyph = chunk.Text;
var left = chunk.StartLocation[0];
var top = chunk.StartLocation[1];
var right = chunk.EndLocation[0];
var bottom = chunk.EndLocation[1];
}
}
var reader = new PdfReader(bytes);
var strategy = new GlyphExtractionStrategy();
PdfTextExtractor.GetTextFromPage(reader, pageNumber: 1, strategy);
reader.Close();
Is this possible in GemBox? If so, that would be helpful, because we already have the code to combinine the glphs into "words".
Currently, I can somewhat work around this using regex, but this is not always possible and also way too technical for end users to configure.