0

I am using iText Java API to extract text from a PDF.

String text =  PdfTextExtractor.getTextFromPage(reader,i);

Src PDF content:

1.2 SUBMITTALS

Generated Text:

SUBMITTALS
1.2

Extracted Text is split into 2 separate lines and order of the text is also messed up.

Can someone please help me understand what am I doing wrong?

Src pdf file link - https://www.dropbox.com/s/vc9it3c7856ejli/testPDF.pdf?dl=0

Target text file generated from iText - https://www.dropbox.com/s/ps2l9yz5ufuup01/test.txt?dl=0

But when I test with other PDF APIs like PDFClown, OCROnline it is working as expected.

Please help

Thanks

Samuel Huylebroeck
  • 1,639
  • 11
  • 15
vdeveloper
  • 31
  • 7
  • reformat the code please – aleb2000 Oct 08 '16 at 12:38
  • Also what does "Src PDF content - 1.2 SUBMITTALS Generated Text - SUBMITTALS 1.2" is from? – aleb2000 Oct 08 '16 at 12:38
  • in Src PDF Link, open the PDF, you would see a section which starts with 1.2 SUBMITTALS and in The Target Text File link - for the same section you would see SUBMITTALS in one line and 1.2 in a separate line. In my Java code I am just using "String text = PdfTextExtractor.getTextFromPage(reader,i);" to extract the page content. – vdeveloper Oct 08 '16 at 16:26
  • By the way, I am using 5.5.7 and tried with 5.5.10 as well. Same result. – vdeveloper Oct 08 '16 at 19:58
  • I'm not really sure but maybe that the PDF coverter of iText make a different formatting from other pdf apis. – aleb2000 Oct 08 '16 at 21:05
  • Does it make a difference if i use itext 7 version? Anyone has experienced this issue before? This seems to be a very common problem statement. Experts - please help here. – vdeveloper Oct 08 '16 at 21:42

1 Answers1

2

The cause

iText with its standard text extraction strategy extracts

Screenshot

as

SUBMITTALS
1.2 

because the "1.2" actually is located (minutely) below the "SUBMITTALS":

q .75000 0 0 .75000 0 792 cm 
1 1 1 rg 0 0 816 -1056 re f 
q .32000 0 0 .32000 0 0 cm 
q 
...
q .20823 0 0 .20807 0 0 cm 
BT /F2 220 Tf 0 g 2340 -6628 Td(SUBMITTALS) Tj ET Q
q .20823 0 0 .20807 0 0 cm 
BT /F2 220 Tf 0 g 1440 -6634 Td(1.2) Tj ET Q

As you can see in this excerpt of the content drawing instructions from the PDF, the "1.2" is drawn at the scaled y coordinate -6634 while "SUBMITTALS" is drawn at -6628, i.e. "1.2" is drawn 6 scaled units below "SUBMITTALS".

This makes iText put it onto a separate following line.

A solution

You can use the HorizontalTextExtractionStrategy2 from this answer instead of the default extraction strategy, cf. TextExtraction.java test testTestPDF, and get this output:

1.2 SUBMITTALS 

(For details on the use of that strategy, confer the answer mentioned above. HorizontalTextExtractionStrategy2 is the updated strategy from the section "UPDATE: Changes in LocationTextExtractionStrategy" of that answer.)

Community
  • 1
  • 1
mkl
  • 90,588
  • 15
  • 125
  • 265