I am using iTextSharp to extract text from PDF. The problem is that if you have tables or a form structure in a page then the extracted text becomes unstructured which does not make any sense. An example PDF page looks like below
The extracted text from iTextSharp is shown below
700061
04-01-17
Prepared for: Prepared by:
Filing Instructions
JACK & JILL ANDERSON WATSON ASSOC
1234 MAIN STREET BENNINGTON STREET
NEWPORT BEACH, CA 92660 STANFORD, NJ 700049
2017 U.S. INDIVIDUAL INCOME TAX RETURN
YOU HAVE A BALANCE DUE OF..........................$ 8141
THIS RETURN HAS BEEN PREPARED FOR ELECTRONIC FILING AND THE PRACTITIONER
PIN PROGRAM HAS BEEN ELECTED. PLEASE SIGN AND RETURN FORM 8879 TO OUR
OFFICE. WE WILL THEN TRANSMIT YOUR RETURN ELECTRONICALLY TO THE IRS. DO
NOT MAIL THE PAPER COPY OF THE RETURN TO THE IRS. RETURN FEDERAL FORM
8879 TO US BY APRIL 17, 2018.
2018 U.S. ESTIMATED INDIVIDUAL INCOME TAX
ESTIMATED TAX VOUCHERS ARE DUE AS FOLLOWS:
$ 3000 DUE BY APRIL 17, 2018
$ 2926 DUE BY JUNE 15, 2018
$ 2852 DUE BY SEPTEMBER 17, 2018
$ 2426 DUE BY JANUARY 15, 2019
INCLUDE YOUR SSN AND THE WORDS "2018 FORM 1040-ES" ON YOUR CHECK.
MAIL ON OR BEFORE THE DUE DATE TO: INTERNAL REVENUE SERVICE CENTER
P.O. BOX 510000
SAN FRANCISCO, CA 94151-5100
FORM 1040-V
PAYMENT SHOULD BE SUBMITTED WITH FORM 1040-V. INCLUDE YOUR SSN, PHONE
NUMBER AND THE WORDS "2017 FORM 1040" ON YOUR CHECK. MAKE CHECK FOR
$8141 PAYABLE TO UNITED STATES TREASURY.
MAIL BY APRIL 17, 2018 TO: INTERNAL REVENUE SERVICE CENTER
P.O. BOX 7704
SAN FRANCISCO, CA 94120-7704
The thing to notice here is that the first line is not 'Filing Instructions' and in the next line after the text 'Prepared for:' we will read 'JACK & JILL ANDERSON', not 'Prepared By:'. Also when we look at PDF we will read '1234 MAIN STREET' after 'JACK & JILL ANDERSON' but in the extracted text it is 'WATSON ASSOC'.
Is there a way to extract text the way we would read the PDF document.
The code to extract text is
PdfReader pdfReader = new PdfReader(fileName);
PdfDocument doc = new PdfDocument(pdfReader);
for (int pageNo = 1; pageNo <= doc.GetNumberOfPages(); pageNo++)
{
PdfPage page = doc.GetPage(pageNo);
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(page, strategy);
}