0

When extracting text from a PDF, the exponentials are not kept inline. How would I go about resolving this?

string text = string.Empty;
using (PdfReader reader = new PdfReader(fileLocation))
{
    ITextExtractionStrategy strategy;
    RenderFilter[] filter = new RenderFilter[1];

    for (int page = 2; page < reader.NumberOfPages; page++)
    {
        RectangleJ mediaBox = reader.GetPageSize(page);
        filter[0] = new RegionTextRenderFilter(new RectangleJ(mediaBox.Left, mediaBox.Bottom+60, mediaBox.Right, mediaBox.Top-140));
        strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), filter);
        text += PdfTextExtractor.GetTextFromPage(reader, page, strategy) + "\n\n";
    }
}

If the line of text in the PDF is:

Example

The result after the text has been extracted is:

-4 3

2.9 x 10 m

But it should be 2.9^-4 x10^3

  • Doesn't this depend on how the PDF writer created it? Perhaps the PDF printer that was used created a table to display the exponents, causing the text to be returned in the order you observe. – CodeCaster Nov 20 '16 at 18:24
  • The text in the PDF that I am referring to is written as superscript so should be interpreted as one line? –  Nov 20 '16 at 18:26
  • 1
    Perhaps [How can I extract subscript / superscript properly from a PDF using iTextSharp?](http://stackoverflow.com/questions/33492792/how-can-i-extract-subscript-superscript-properly-from-a-pdf-using-itextsharp) is relevant then. – CodeCaster Nov 20 '16 at 18:28
  • @CodeCaster I will take a look - that seems like it will be helpful. –  Nov 20 '16 at 18:31

0 Answers0