iText java not parsing text properly from PDF/

Question

I am using iText Java API to extract text from a PDF.

String text =  PdfTextExtractor.getTextFromPage(reader,i);

Src PDF content:

1.2 SUBMITTALS

Generated Text:

SUBMITTALS
1.2

Extracted Text is split into 2 separate lines and order of the text is also messed up.

Can someone please help me understand what am I doing wrong?

Src pdf file link - https://www.dropbox.com/s/vc9it3c7856ejli/testPDF.pdf?dl=0

Target text file generated from iText - https://www.dropbox.com/s/ps2l9yz5ufuup01/test.txt?dl=0

But when I test with other PDF APIs like PDFClown, OCROnline it is working as expected.

Please help

Thanks

Also what does "Src PDF content - 1.2 SUBMITTALS Generated Text - SUBMITTALS 1.2" is from? — aleb2000, Oct 08 '16 at 12:38
in Src PDF Link, open the PDF, you would see a section which starts with 1.2 SUBMITTALS and in The Target Text File link - for the same section you would see SUBMITTALS in one line and 1.2 in a separate line. In my Java code I am just using "String text = PdfTextExtractor.getTextFromPage(reader,i);" to extract the page content. — vdeveloper, Oct 08 '16 at 16:26
By the way, I am using 5.5.7 and tried with 5.5.10 as well. Same result. — vdeveloper, Oct 08 '16 at 19:58
I'm not really sure but maybe that the PDF coverter of iText make a different formatting from other pdf apis. — aleb2000, Oct 08 '16 at 21:05
Does it make a difference if i use itext 7 version? Anyone has experienced this issue before? This seems to be a very common problem statement. Experts - please help here. — vdeveloper, Oct 08 '16 at 21:42

score 2 · Accepted Answer · edited Jun 20 '20 at 09:12

The cause

iText with its standard text extraction strategy extracts

as

SUBMITTALS
1.2

because the "1.2" actually is located (minutely) below the "SUBMITTALS":

q .75000 0 0 .75000 0 792 cm 
1 1 1 rg 0 0 816 -1056 re f 
q .32000 0 0 .32000 0 0 cm 
q 
...
q .20823 0 0 .20807 0 0 cm 
BT /F2 220 Tf 0 g 2340 -6628 Td(SUBMITTALS) Tj ET Q
q .20823 0 0 .20807 0 0 cm 
BT /F2 220 Tf 0 g 1440 -6634 Td(1.2) Tj ET Q

As you can see in this excerpt of the content drawing instructions from the PDF, the "1.2" is drawn at the scaled y coordinate -6634 while "SUBMITTALS" is drawn at -6628, i.e. "1.2" is drawn 6 scaled units below "SUBMITTALS".

This makes iText put it onto a separate following line.

A solution

You can use the HorizontalTextExtractionStrategy2 from this answer instead of the default extraction strategy, cf. TextExtraction.java test testTestPDF, and get this output:

1.2 SUBMITTALS

(For details on the use of that strategy, confer the answer mentioned above. HorizontalTextExtractionStrategy2 is the updated strategy from the section "UPDATE: Changes in LocationTextExtractionStrategy" of that answer.)

Thanks a lot for the nice explanation!! – vdeveloper Oct 09 '16 at 22:36 — vdeveloper, Oct 09 '16 at 22:36

iText java not parsing text properly from PDF/

1 Answers1

The cause

A solution