1

I need to extract text with its coordinates using c#

i am using pdfboxnet using c# and here it is

class MyTextStripper : PDFTextStripper
{

    protected override void processTextPosition(TextPosition text)
    {
        base.processTextPosition(text);
        Console.WriteLine("X: " + text.getX() +
            " y: " + text.getY() +
            " height: " + text.getHeight() +
            " width: " + text.getWidth() +
            " word: " + text.getCharacter());

    }
}
class Program
{
    static void Main(string[] args)
    {

        ExtractTextFromPdf(@"C:\Users\Desktop\mathml88.pdf");
    }

    private static string ExtractTextFromPdf(string path)
    {
        PDDocument doc = null;
        try
        {
            doc = PDDocument.load(path);

            MyTextStripper stripper = new MyTextStripper();

            return stripper.getText(doc);
        }
        finally
        {
            if (doc != null)
            {
                doc.close();
            }
        }
    }
}

and here is the output of the program

http://pastebin.com/JwA2YaC7

i link the output to pastebin because its long.

and here is the pdf i used.

https://drive.google.com/open?id=0B45rDxvaXzsmcFo1QXhNdDBXT28

i have two questions here. how can i know that the characters are one word?

using x,y? is it correct?

and another question is. why does it doesn't extract all the text? or im missing some code? i know the equation cant be extracted as it is, but how accurate is pdfbox when i comes with extracting pdf text?

i already tried bytescout but i don't have a license so im trying pdfbox. but bytescout can extract words and its coordinates

pdf asker
  • 59
  • 2
  • 11
  • Would an answer with Java samples help you, too? I only being using PDFBox/Java, not PDFBoxNet... – mkl Oct 28 '16 at 07:00
  • @mkl that may help. i will try to convert it to c# code. – pdf asker Oct 28 '16 at 08:50
  • *"and another question is. why does it doesn't extract all the text? or im missing some code? i know the equation cant be extracted as it is, but how accurate is pdfbox when i comes with extracting pdf text?"* - Which text is missing? E.g. in your example file? – mkl Oct 28 '16 at 16:27
  • 1
    Did you find a way to extract text with location?If yes then please, post the code , as I am also facing the same issue. – V K Jan 10 '17 at 07:51

1 Answers1

2

PDF and words

The Portable Document Format (PDF) does not know the concept of words, or at least it does not require textual content to be clearly arranged as words.

(There is one feature, word spacing, which only works if one uses a clearly identified space glyph to separate glyph groups which make up individual words, but this feature is not used that often.)

Thus, to recognize words in PDFs one indeed has to analyze the glyphs in them and their positions.

PDFBox and words

The PDFTextStripper base parses the content and separately reports each glyph rendered via the processTextPosition methods. The default implementation of that method then collects these individual glyph data with some treatment of glyphs at the same position.

When all of a page is parsed, the collected data are arranged into lines (after sorting if SortByPosition is true) which then are broken into words according to a number of heuristics which in turn are forwarded to writeString which writes the word into a buffer the content of which eventually is returned as extracted text.

(This is somewhat simplified but should suffice for the question at hand.)

Thus, those two mentioned methods are the main code positions to override with own code.

  • One overrides processTextPosition if one
    • wants the glyph characters in the raw order of their appearance in the stream instead of sorted and arranged, or if one
    • needs to access and react to the state of the parsed stream at the moment the glyph is rendered.
  • On the other hand one overrides writeString if one is interested in the sorted and arranged glyph characters.

For some tasks one actually needs to override both, e.g. like in this answer.

Example for PDFBox & Java

A simple implementation in PDFBox & Java (in a comment the OP mentioned that this could help him, too) might look like this

String extractWordLocations(PDDocument document) throws IOException
{
    PDFTextStripper stripper = new PDFTextStripper()
    {
        @Override
        protected void writeString(String text, List<TextPosition> textPositions) throws IOException
        {
            super.writeString(text, textPositions);

            TextPosition firstProsition = textPositions.get(0);
            TextPosition lastPosition = textPositions.get(textPositions.size() - 1);
            writeString(String.format("[%s - %s / %s]", firstProsition.getXDirAdj(), lastPosition.getXDirAdj() + lastPosition.getWidthDirAdj(), firstProsition.getYDirAdj()));
        }
    };
    stripper.setSortByPosition(true);
    return stripper.getText(document);
}

(From ExtractText.java)

Applying it like this to your example file

try (   InputStream documentStream = getClass().getResourceAsStream("mathml88.pdf" );
        PDDocument document = PDDocument.load(documentStream))
{
    String wordLocations = extractWordLocations(document);

    System.out.println("\n'mathml88.pdf', extract with word locations:");
    System.out.println(wordLocations);
    System.out.println("***********************************");
}

(ExtractText test method testExtractWordLocationsFromMathml88)

results in

88[74.34 - 85.2491 / 61.241028] Chapter[378.835 - 413.37317 / 61.241028] 3.[416.10043 - 424.28226 / 61.241028] Presentation[429.73682 - 483.67136 / 61.241028] Markup[486.39862 - 520.93677 / 61.241028]
3.4.3.3[74.34 - 104.34002 / 97.10602] Examples[120.70367 - 163.72914 / 97.10602]
The[74.34 - 91.30365 / 117.565] msubsup[93.55299 - 133.57849 / 117.565] is[135.816 - 143.09236 / 117.565] most[145.33963 - 166.55782 / 117.565] commonly[168.80508 - 215.47418 / 117.565] used[217.72145 - 237.71782 / 117.565] for[239.976 - 252.69601 / 117.565] adding[254.94328 - 284.63788 / 117.565] sub/superscript[286.88516 - 352.93976 / 117.565] pairs[355.18704 - 376.39432 / 117.565] to[378.6416 - 387.12888 / 117.565] identifiers[389.37616 - 433.0125 / 117.565] as[435.2598 - 444.34708 / 117.565] illustrated[446.60526 - 490.23068 / 117.565] above.[492.48886 - 520.9398 / 117.565]
However,[74.34 - 115.90368 / 131.11401] another[118.88187 - 151.59825 / 131.11401] important[154.56552 - 196.991 / 131.11401] use[199.96918 - 214.511 / 131.11401] is[217.48918 - 224.76555 / 131.11401] placing[227.73282 - 259.8492 / 131.11401] limits[262.8274 - 287.68918 / 131.11401] on[290.66736 - 301.57648 / 131.11401] certain[304.54376 - 334.22736 / 131.11401] large[337.20554 - 358.81644 / 131.11401] operators[361.78372 - 402.37646 / 131.11401] whose[405.35464 - 433.22742 / 131.11401] limits[436.2056 - 461.06738 / 131.11401] are[464.03467 - 477.35464 / 131.11401] tradition-[480.33282 - 520.93646 / 131.11401]
ally[74.34 - 90.70365 / 144.664] displayed[93.812744 - 135.62732 / 144.664] in[138.74731 - 147.23459 / 144.664] the[150.34369 - 163.6746 / 144.664] script[166.7837 - 191.02373 / 144.664] positions[194.14372 - 233.54736 / 144.664] even[236.65646 - 256.8165 / 144.664] when[259.9256 - 283.55472 / 144.664] rendered[286.6747 - 324.82382 / 144.664] in[327.94382 - 336.4311 / 144.664] display[339.5402 - 371.05658 / 144.664] style.[374.16568 - 397.5002 / 144.664] The[400.6202 - 417.58386 / 144.664] most[420.69296 - 441.91116 / 144.664] common[445.02026 - 483.20212 / 144.664] of[486.3221 - 495.4094 / 144.664] these[498.5185 - 520.93665 / 144.664]
is[74.34 - 81.61636 / 158.21301] the[84.343636 - 97.67456 / 158.21301] integral.[100.40183 - 136.29279 / 158.21301] For[139.02007 - 154.00917 / 158.21301] example,[156.73645 - 196.26012 / 158.21301]
?[120.703995 - 126.847725 / 193.42804] 1[131.73799 - 136.22119 / 196.88904]
ex[138.04799 - 147.33707 / 208.36603] dx[149.18999 - 160.47609 / 208.36603]
0[126.83699 - 131.32019 / 217.77405]

As you see an expression "[xstart - xend / y]" is attached to each word.

Putting all the information into a String is for proof-of-concept purposes only. For production use you may instead want to create a WordWithPosition class, create an instance of that class for each word in writeString and store those objects in a List the content of which you eventually retrieve from your PDFTextStripper extension.

Community
  • 1
  • 1
mkl
  • 90,588
  • 15
  • 125
  • 265