0

I have two PDF files I have to parse with iTextSharp (because we don't have source data, used to build that PDF).

I was able to build a parsing procedure in C# which parses first PDF line by line. The same procedure was supposed to work on the second file, but unfortunately it doesn't, because the line order in the second PDF is completely broken and it differs from the visual structure.

The strange thing that not only iTextSharp is interpreting the second file like that, even Acrobat Reader DC's text selection tool fails to select the information with a correct line order (when I start selecting lines, some pieces of text remain unselected and get highlighted AFTER the next few rows are reached with the caret).

Text selection in the 1st PDF (in Acrobat Reader DC) Text selection in the first PDF

Text selection in the 2nd PDF (in Acrobat Reader DC) enter image description here

Basically every piece of information (a word or a short phrase) is placed on it's own line (!). Some words/phrases are actually read as they were at the bottom of the page, while they are clearly near the top etc

How do I fix / read the second PDF properly? Any ideas what happened to that file?

UPDATE

Adding links to both PDF files

Salaros
  • 1,444
  • 1
  • 14
  • 34
  • Why do you find it strange that two PDF readers would process the same PDF in the same way? Intuitively, it is what one might expect; it could be considered strange if they read the file in different orders. (Particularly as the order probably corresponds to the order of text in the file itself.) – rici Mar 07 '17 at 02:23

2 Answers2

1

PDF per se does not know about "lines" or "paragraphs" etc.

If you have a structured PDF, you can use the structure to determine what belongs together.

If not, you will have to read out the bounding box of the pieces of text and use some heuristics to determine what belongs together.

Max Wyss
  • 3,549
  • 2
  • 20
  • 26
  • That's exactly what I did with the first PDF file. I have created a custom iTextSharp's text extraction strategy, using distance between blocks to read pages as blocks (door image + rows below the images + the dimensions on the right edge). However the second PDF is just a bunch of lines made of single word, read in random order. Just look at the images in my question, how often do you see PDFs where text selection works like that? Just Google for a PDF, open it and try selecting some text from it, it will probably allow you to select lines of text word-by-word with a correct order – Salaros Mar 06 '17 at 22:03
  • Did you (manually) use Alt-Drag (I think that's it on Windows; on Mac, it would be Option-Drag)? Of course, that does not help much if you have to do it pogrammatically. – Max Wyss Mar 06 '17 at 23:35
  • You can use mouse, Alt+mouse drag, Shift+PgDown etc, it doesn't really matter. I'm trying to say that BOTH iTextSharp AND Acrobat Reader DC are reading the 2nd PDF in the same (disordered) way. Quote: `The strange thing that not only iTextSharp is interpreting the second file like that, even Acrobat Reader DC's text selection tool fails to select the information with a correct line order` – Salaros Mar 07 '17 at 00:42
  • @Salaros you say you created a custom text extraction strategy but you don't show it. Thus, we cannot analyse why it does what it does the way it does. Adobe Reader, though, to a certain degree takes the order in which strings are drawn on a page (which can completely differ from the order you as a human perceive in the document) into account during text extraction (copy & paste). Often PDF creating programs add the drawing instructions in an order a human understands but sometimes (especially after editing PDFs) scrambles those drawing instructions with funny results while copying & pasting. – mkl Mar 07 '17 at 08:25
1

@Max already answered,

PDF per se does not know about "lines" or "paragraphs" etc.

In particular the order in which text drawing instructions appear in the page content stream can be a line-by-line order making text extraction and analysis easy but it also can be a semi-random, completely non-intuitive order.

I'll flesh out his answer a bit.

The sample PDFs

In case of your PDFs you can see examples for both options, the text bits on page 7 of "HUTTIG - ThermaTru JAN2016.pdf" are drawn in this order

Glass & Caming Options Door Only Pricing Classic-Craft®
LE - Low - E A-Brass For Prehung Units see:
FXG-Fixed Grille C-Brushed Nickel Frame Adder and Options Pages
RG-Removable Grille D-Black Nickel Pricing Valid only when Prehung
SDL-Simulated Divide Lite
GBGF-Flat(W,B,A) Grille In Glass
GBGC-Cntr(W,B,A) Grille In Glass
W-Wrought Iron For Additional Options See Adder Page
?=Stock s=Rapid Ship
American Collection™
~~CCA210 ~~CCA210XC ~~CCA210XJ ~~CCA210XN ~~CCA210XR ~~CCA211
1 CCA210-LE CCA210XC CCA210XJ CCA210XN CCA210XR CCA211
Low-E Chord Chinchilla Granite Rainglass Homeward C D
2'8" x 6'8"
2'10" x 6'8"
3'0" x 6'8" $582.67 ? $865.22 ? $898.90 $898.90 $898.90 $923.30 ??
3'6" x 6'8"
Slab CANF3026L1L CANF3026DXC CANF3026DXJ CANF3026DXN CANF3026DXR CANF3026D1HW1C
Insert
Grille
~~CCA212 ~~CCA220 ~~CCA220XC ~~CCA220XJ ~~CCA220XN ~~CCA220XR
1 CCA212 CCA220-SDLLE CCA220XC-SDL CCA220XJ-SDL CCA220XN-SDL CCA220XR-SDL
Villager C D SDL Low-E SDL Chord SDL Chinchilla SDL Granite SDL Rainglass
2'8" x 6'8"
2'10" x 6'8"
3'0" x 6'8" $1,176.57 $641.72 ? $924.27 ? $957.94 $957.94 $957.94
3'6" x 6'8"
Slab CANF3026D1VG1C CANF3026L1L CANF3026DXC CANF3026DXJ CANF3026DXN CANF3026DXR
Insert
Grille CCALD2618V24 CCALD2618V24 CCALD2618V24 CCALD2618V24 CCALD2618V24
~~CCA221 ~~CCA222 ~~CCA230 ~~CCA230XC ~~CCA230XJ ~~CCA230XN
1 CCA221-SDL CCA222-SDL CCA230-SDLLE CCA230XC-SDL CCA230XJ-SDL CCA230XN-SDL
SDL Homeward C D SDL Villager C D SDL Low-E SDL Chord SDL Chinchilla SDL Granite
2'8" x 6'8"
2'10" x 6'8"
3'0" x 6'8" $1,042.37 ? $1,258.06 $701.26 ? $983.81 ? $1,017.48 $1,017.48
3'6" x 6'8"
Slab CANF3026D2HW1C CANF3026D2VG1C CANF3026L1L CANF3026DXC CANF3026DXJ CANF3026DXN
Insert
Grille CCALD2618V24 CCALD2618V24 CCALD2618V24 CCALD2618V24 CCALD2618V24 CCALD2618V24
March 2016 Confidential | Huttig Building Products | Prices Subject to Change Without Notice Page 7 of 814

As you see the order is approximately the order in which we would read the page.

The text bits on page 7 of "Huttig - 2017 Therma-Tru Catalog.pdf", on the other hand, are drawn in this order

Confidential | Prices Subject to Change Without Notice | Terms & Conditions: www.huttig.com/salesterms January 28,2017 Page 7 of 820
Glass & Caming Options Classic-Craft® Standard Single Unit Includes:

  LE - LOW - E
  FXG - Fixed Grille
  RG - Removable Grille
  SDL - Simulated Divide Lite
  GBGF - Flat(W,B,A) Grille In Glass
  GBGC - Cntr(W,B,A) Grille In Glass
  W - Wrought Iron
  SDLF1 - 1-1/8" SDL
  SDLF2 - 3-1/2" SDL
A - Brass
C - Brushed Nickel
D - Black Nickel
XC - Chord
XJ - Chinchilla
XN - Granite
XR - Rainglass
XE - Satin Etch
For Prehung Units See:
Frame Adders and Options Pages
Pricing Valid only when Prehung
For Additional Options See Adder Page
= Rapid = Stock
American Collection™
Slab
Grille

Chinchilla
$940.38 $906.22
CANF3026DXC
$940.38
CANF3026DXN
$940.38
CANF3026DXR
$967.22
CANF3026D1HW1D
CCA212 CCA220-SDLLE CCA220XJ-SDL CCA220XC-SDL CCA220XN-SDL CCA220XR-SDL
Villager SDL Low-E SDL Chinchilla SDL Chord SDL Granite SDL Rainglass
$1,231.71 $672.46
CCALD2618V24
$1,002.35
CCALD2618V24
$968.19
CCALD2618V24
$1,002.35
CCALD2618V24
$1,002.35
CCALD2618V24
CCA221-SDL CCA222-SDL CCA230-SDLLE CCA230XJ-SDL CCA230XC-SDL CCA230XN-SDL
SDL Homeward SDL Villager SDL Low-E SDL Chinchilla SDL Chord
$1,090.68
CANF3026D2HW1D
$1,316.62 $734.44 $1,064.82 $1,030.66 $1,064.82
SDL Granite
CANF3026DXJ
CCA210-LE
Low-E
CCA210XJ CCA210XC CCA210XN CCA210XR CCA211
Chord Rainglass Homeward
CCALD2618V24 CCALD2618V24 CCALD2618V24 CCALD2618V24 CCALD2618V24 CCALD2618V24
CANF3026D2VG1D CANF3026L1L CANF3026DXJ CANF3026DXC CANF3026DXN
CANF3026DXR CANF3026DXN CANF3026DXC CANF3026DXJ CANF3026L1L CANF3026D1VG1D
C  D
C  D C  D
Granite
CANF3026L1L
$610.00 3' 0" x 6'8"
3' 0" x 6'8"
3' 0" x 6'8"
Slab
Insert
Grille
Slab
Grille
Insert
     
Insert
 
C  D
 
Available
Available
Available

As you see the prices from the first table (except the left-most $610) are drawn pretty early, then stuff from the other tables, then the top CCA and the bottom CANF identifiers of the first table, then the C and D column headers of all tables, then some other entries, then the missing $610 price from the first table, then the row headers of all tables, and finally three occurrences of "Available" which are present but invisible on the page.

Any ideas what happened to that file?

I could speculate. Based on the producer properties of the PDFs it appears that the 2016 catalog is exported directly from MS Excel 2010 while the 2017 catalog has been created by Ghostscript. Furthermore the catalogs appear similar but different enough to assume that the 2017 one is not created from the same source Excel file but probably using a completely different tool chain with the task of producing something that looks similar to the former catalog, not something that looks identical to it let alone something that is internally built identically as it.

How to deal with this

How do I fix / read the second PDF properly?

First of all, there is nothing to fix: For generic PDFs there is no requirement to draw text in any particular order. Thus, the second PDF is not broken (at least not in this regard) and, therefore, cannot be fixed.

To read and analyse it properly, you have to do as @Max answered,

you will have to read out the bounding box of the pieces of text and use some heuristics to determine what belongs together.

Unfortunately you have not posted (the pivotal parts of) your text extraction strategy. For more detailed help please post it.

Considering, though, that you were able to build a parsing procedure in C# which parses first PDF line by line but not the second file because the line order in the second PDF is completely broken and it differs from the visual structure, I assume your custom text extraction strategy is based on the SimpleTextExtractionStrategy in so far as it assumes the text bits to come in a line-by-line order. On top of that it probably uses the text bit coordinates (at least the x coordinates) to determine the column the text bit appears in.

The line-by-line assumption does not always hold, in particular not in case of your 2017 catalog. Thus, you should re-implement your strategy based on the LocationTextExtractionStartegy code, i.e. first collect all bits of text from a page including its relevant coordinates (bounding box, base line, ...), then sort these bits (top-to-bottom, left-to-right), and then execute your additional logic to identify columns.

Community
  • 1
  • 1
mkl
  • 90,588
  • 15
  • 125
  • 265
  • First of all I'm already using an extraction strategy based on **LocationTextExtractionStrategy**. It overrides `RenderText(TextRenderInfo renderInfo)` method and fetches line segments and transforms them to TextChunks etc. I already tried to iterate line segments / text chunks and sort them using coordinates of those elements, but I still have to figure out how to access them. Could you please share strategy's code you used to get the results above? – Salaros Mar 07 '17 at 17:23
  • Additionally in your answer you said `The line-by-line assumption does not always hold`, but in case it worked pretty well, I had only 8 doors not parsed, out of 800+ pages, each containing at least 12-18 elements. I used distance between text chunks in order to append spaces or tabs (1,2,3,4 or even five tabs), so the resulting data structure could be easily persisted to a CSV file (tab separated) and opened as an Excel file. – Salaros Mar 07 '17 at 17:30
  • If you indeed use an extraction strategy based on `LocationTextExtractionStrategy`, people here can tell even less without seeing it why your heuristics fail. *I used distance between text chunks in order to append spaces or tabs* - as you surely have seen the table columns are spread farther apart in the new file. Thus, you will likely at least need to adapt the criteria for tab insertion if tabs are meant to separate these table columns in the output. – mkl Mar 08 '17 at 08:17
  • @Salaros *"Could you please share strategy's code you used to get the results above?"* - that was the plain `SimpleTextExtractionStrategy` (with some extraction artifacts removed manually in the 2017 output). – mkl Mar 08 '17 at 08:20
  • @Salaros I just applied another text extraction strategy (the one described in [this answer](http://stackoverflow.com/a/24911617/1729265) with `fixedCharWidth` decreased from 6 to 4 to match the small font size in your catalogs; it is in Java / iText but should be easy to port to C# / iTextSharp) to your PDFs, and for both files the output appears like it should be easy to post-process into CSV data. You might want to look at that strategy for inspiration. – mkl Mar 08 '17 at 08:50