Extracting text from a PDF with exponential's

Question

When extracting text from a PDF, the exponentials are not kept inline. How would I go about resolving this?

string text = string.Empty;
using (PdfReader reader = new PdfReader(fileLocation))
{
    ITextExtractionStrategy strategy;
    RenderFilter[] filter = new RenderFilter[1];

    for (int page = 2; page < reader.NumberOfPages; page++)
    {
        RectangleJ mediaBox = reader.GetPageSize(page);
        filter[0] = new RegionTextRenderFilter(new RectangleJ(mediaBox.Left, mediaBox.Bottom+60, mediaBox.Right, mediaBox.Top-140));
        strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), filter);
        text += PdfTextExtractor.GetTextFromPage(reader, page, strategy) + "\n\n";
    }
}

If the line of text in the PDF is:

Example

The result after the text has been extracted is:

-4 3

2.9 x 10 m

But it should be 2.9^-4 x10^3

Doesn't this depend on how the PDF writer created it? Perhaps the PDF printer that was used created a table to display the exponents, causing the text to be returned in the order you observe. — CodeCaster, Nov 20 '16 at 18:24
The text in the PDF that I am referring to is written as superscript so should be interpreted as one line? — , Nov 20 '16 at 18:26
Perhaps [How can I extract subscript / superscript properly from a PDF using iTextSharp?](http://stackoverflow.com/questions/33492792/how-can-i-extract-subscript-superscript-properly-from-a-pdf-using-itextsharp) is relevant then. — CodeCaster, Nov 20 '16 at 18:28
@CodeCaster I will take a look - that seems like it will be helpful. — , Nov 20 '16 at 18:31

Extracting text from a PDF with exponential's

0 Answers0