0

I am using iTextSharp to extract text from PDF. The problem is that if you have tables or a form structure in a page then the extracted text becomes unstructured which does not make any sense. An example PDF page looks like below

Sample Tax Form

The extracted text from iTextSharp is shown below

700061
04-01-17
Prepared for: Prepared by:
Filing Instructions

    JACK & JILL  ANDERSON                 WATSON ASSOC
    1234 MAIN STREET                      BENNINGTON STREET
    NEWPORT BEACH, CA  92660              STANFORD, NJ  700049

    2017 U.S. INDIVIDUAL INCOME TAX RETURN

      YOU HAVE A BALANCE DUE OF..........................$         8141

      THIS RETURN HAS BEEN PREPARED FOR ELECTRONIC FILING AND THE PRACTITIONER 
      PIN PROGRAM HAS BEEN ELECTED.  PLEASE SIGN AND RETURN FORM 8879 TO OUR 
      OFFICE.  WE WILL THEN TRANSMIT YOUR RETURN ELECTRONICALLY TO THE IRS.  DO
      NOT MAIL THE PAPER COPY OF THE RETURN TO THE IRS.  RETURN FEDERAL FORM 
      8879 TO US BY APRIL 17, 2018.
    2018 U.S. ESTIMATED INDIVIDUAL INCOME TAX

      ESTIMATED TAX VOUCHERS ARE DUE AS FOLLOWS:
      $      3000  DUE BY  APRIL 17, 2018
      $      2926  DUE BY  JUNE 15, 2018
      $      2852  DUE BY  SEPTEMBER 17, 2018
      $      2426  DUE BY  JANUARY 15, 2019

      INCLUDE YOUR SSN AND THE WORDS "2018 FORM 1040-ES" ON YOUR CHECK.

      MAIL ON OR BEFORE THE DUE DATE TO: INTERNAL REVENUE SERVICE CENTER
                                         P.O. BOX 510000
                                         SAN FRANCISCO, CA  94151-5100







    FORM 1040-V

      PAYMENT SHOULD BE SUBMITTED WITH FORM 1040-V.  INCLUDE YOUR SSN, PHONE 
      NUMBER AND THE WORDS "2017 FORM 1040" ON YOUR CHECK.  MAKE CHECK FOR 
      $8141 PAYABLE TO UNITED STATES TREASURY.

      MAIL BY APRIL 17, 2018 TO:     INTERNAL REVENUE SERVICE CENTER
                                     P.O. BOX 7704
                                     SAN FRANCISCO, CA  94120-7704

The thing to notice here is that the first line is not 'Filing Instructions' and in the next line after the text 'Prepared for:' we will read 'JACK & JILL ANDERSON', not 'Prepared By:'. Also when we look at PDF we will read '1234 MAIN STREET' after 'JACK & JILL ANDERSON' but in the extracted text it is 'WATSON ASSOC'.

Is there a way to extract text the way we would read the PDF document.

The code to extract text is

PdfReader pdfReader = new PdfReader(fileName);
PdfDocument doc = new PdfDocument(pdfReader);
for (int pageNo = 1; pageNo <= doc.GetNumberOfPages(); pageNo++)
{
    PdfPage page = doc.GetPage(pageNo);
    ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
    string currentText = PdfTextExtractor.GetTextFromPage(page, strategy);
}
mdowes
  • 592
  • 7
  • 18
  • The first line is "Foling Instruction" as per the screen shot of the PDF. I'm assuming you did not create the PDF. Which means you assume the PDF will always stay the same. Your best bet would to be to extract specific parts of the PDF text and use those parts to build up a text file... You will always run into formatting issues with your current way(unless you created the PDF) – BossRoss Jan 31 '19 at 12:26
  • 1
    Probably at least this answer will explain to you why it's almost impossible to do what you want in a general way https://stackoverflow.com/a/54459279/6663375 – Uladzimir Asipchuk Jan 31 '19 at 13:08
  • *"Is there a way to extract text the way we would read the PDF document."* - Are you sure we all would read the PDF document the same way? Maybe after seeing that the form is prepared for "JACK & JILL ANDERSON", I'm not interested in their address (like you appear to be as you read '1234 MAIN STREET' thereafter) but instead the entity the form is prepared by, i.e. 'WATSON ASSOC', so maybe I'd read like it's extracted... – mkl Jan 31 '19 at 14:29
  • 3
    What you could do is try different text extraction strategies. You currently use the `SimpleTextExtractionStrategy` (returning text in the order of the respective text drawing instructions in the content streams). You can try `LocationTextExtractionStrategy` which sorts top to bottom, left to right. Or `LayoutTextExtractionStrategy` from [this answer](https://stackoverflow.com/a/46585997/1729265) which additionally tries to add enough spaces to have the output resemble the horizontal layout of the PDF. Or yet other ones. – mkl Feb 01 '19 at 09:34

1 Answers1

3

short answer:

Yes there (probably) is

long answer:

PDF is not like a word document, or an HTML page. PDF documents can contains structural information (indicating which glyphs make up a line of text, which lines make a paragraph, etc). But the spec does not oblige them to do so.

Most PDF documents you'll find in the wild actually don't contain structural information.

iText (and many other libraries as well) use a simple heuristic. They parse the rendering instructions, store them, and sort them in 'logical reading order'. Which is to say top to bottom, left to right.

Of course in documents like this one, the effect is rather poor.

iText does allow you to select which heuristic you'd like to use. If nothing is specified, you are using SimpleTextExtractionStrategy which spits out the glyphs in the order of appearance in the instruction stream (which may not be the same as reading order).

As @mkl said however, not everyone is bound to read a document the same way. It gets even more interesting (and complicated) if you think about scientific papers (footnotes, inline graphics, inline tables, etc) or magazine articles (inline quotes or snippets).

I think you'd be better of trying a tool like pdf2Data, which is part of the iText family. It reads an input document, matches it against a template, and then spits out the information either in a JSON like traversable datastructure, or simply as HTML.

That way, you could match this document against a template, and decide which information you'd like to extract first.

Joris Schellekens
  • 8,483
  • 2
  • 23
  • 54
  • 2
    *"iText (and many other libraries as well) use a simple heuristic. They parse the rendering instructions, store them, and sort them in 'logical reading order'. Which is to say top to bottom, left to right."* - The OP appears to use the `SimpleTextExtractionStrategy`, so no sorting, simply the order of the respective text drawing instructions in the content streams. – mkl Feb 01 '19 at 09:27
  • Duly noted, I'll change my answer accordingly. – Joris Schellekens Feb 01 '19 at 09:49