@Max already answered,
PDF per se does not know about "lines" or "paragraphs" etc.
In particular the order in which text drawing instructions appear in the page content stream can be a line-by-line order making text extraction and analysis easy but it also can be a semi-random, completely non-intuitive order.
I'll flesh out his answer a bit.
The sample PDFs
In case of your PDFs you can see examples for both options, the text bits on page 7 of "HUTTIG - ThermaTru JAN2016.pdf" are drawn in this order
Glass & Caming Options Door Only Pricing Classic-Craft®
LE - Low - E A-Brass For Prehung Units see:
FXG-Fixed Grille C-Brushed Nickel Frame Adder and Options Pages
RG-Removable Grille D-Black Nickel Pricing Valid only when Prehung
SDL-Simulated Divide Lite
GBGF-Flat(W,B,A) Grille In Glass
GBGC-Cntr(W,B,A) Grille In Glass
W-Wrought Iron For Additional Options See Adder Page
?=Stock s=Rapid Ship
American Collection™
~~CCA210 ~~CCA210XC ~~CCA210XJ ~~CCA210XN ~~CCA210XR ~~CCA211
1 CCA210-LE CCA210XC CCA210XJ CCA210XN CCA210XR CCA211
Low-E Chord Chinchilla Granite Rainglass Homeward C D
2'8" x 6'8"
2'10" x 6'8"
3'0" x 6'8" $582.67 ? $865.22 ? $898.90 $898.90 $898.90 $923.30 ??
3'6" x 6'8"
Slab CANF3026L1L CANF3026DXC CANF3026DXJ CANF3026DXN CANF3026DXR CANF3026D1HW1C
Insert
Grille
~~CCA212 ~~CCA220 ~~CCA220XC ~~CCA220XJ ~~CCA220XN ~~CCA220XR
1 CCA212 CCA220-SDLLE CCA220XC-SDL CCA220XJ-SDL CCA220XN-SDL CCA220XR-SDL
Villager C D SDL Low-E SDL Chord SDL Chinchilla SDL Granite SDL Rainglass
2'8" x 6'8"
2'10" x 6'8"
3'0" x 6'8" $1,176.57 $641.72 ? $924.27 ? $957.94 $957.94 $957.94
3'6" x 6'8"
Slab CANF3026D1VG1C CANF3026L1L CANF3026DXC CANF3026DXJ CANF3026DXN CANF3026DXR
Insert
Grille CCALD2618V24 CCALD2618V24 CCALD2618V24 CCALD2618V24 CCALD2618V24
~~CCA221 ~~CCA222 ~~CCA230 ~~CCA230XC ~~CCA230XJ ~~CCA230XN
1 CCA221-SDL CCA222-SDL CCA230-SDLLE CCA230XC-SDL CCA230XJ-SDL CCA230XN-SDL
SDL Homeward C D SDL Villager C D SDL Low-E SDL Chord SDL Chinchilla SDL Granite
2'8" x 6'8"
2'10" x 6'8"
3'0" x 6'8" $1,042.37 ? $1,258.06 $701.26 ? $983.81 ? $1,017.48 $1,017.48
3'6" x 6'8"
Slab CANF3026D2HW1C CANF3026D2VG1C CANF3026L1L CANF3026DXC CANF3026DXJ CANF3026DXN
Insert
Grille CCALD2618V24 CCALD2618V24 CCALD2618V24 CCALD2618V24 CCALD2618V24 CCALD2618V24
March 2016 Confidential | Huttig Building Products | Prices Subject to Change Without Notice Page 7 of 814
As you see the order is approximately the order in which we would read the page.
The text bits on page 7 of "Huttig - 2017 Therma-Tru Catalog.pdf", on the other hand, are drawn in this order
Confidential | Prices Subject to Change Without Notice | Terms & Conditions: www.huttig.com/salesterms January 28,2017 Page 7 of 820
Glass & Caming Options Classic-Craft® Standard Single Unit Includes:
LE - LOW - E
FXG - Fixed Grille
RG - Removable Grille
SDL - Simulated Divide Lite
GBGF - Flat(W,B,A) Grille In Glass
GBGC - Cntr(W,B,A) Grille In Glass
W - Wrought Iron
SDLF1 - 1-1/8" SDL
SDLF2 - 3-1/2" SDL
A - Brass
C - Brushed Nickel
D - Black Nickel
XC - Chord
XJ - Chinchilla
XN - Granite
XR - Rainglass
XE - Satin Etch
For Prehung Units See:
Frame Adders and Options Pages
Pricing Valid only when Prehung
For Additional Options See Adder Page
= Rapid = Stock
American Collection™
Slab
Grille
Chinchilla
$940.38 $906.22
CANF3026DXC
$940.38
CANF3026DXN
$940.38
CANF3026DXR
$967.22
CANF3026D1HW1D
CCA212 CCA220-SDLLE CCA220XJ-SDL CCA220XC-SDL CCA220XN-SDL CCA220XR-SDL
Villager SDL Low-E SDL Chinchilla SDL Chord SDL Granite SDL Rainglass
$1,231.71 $672.46
CCALD2618V24
$1,002.35
CCALD2618V24
$968.19
CCALD2618V24
$1,002.35
CCALD2618V24
$1,002.35
CCALD2618V24
CCA221-SDL CCA222-SDL CCA230-SDLLE CCA230XJ-SDL CCA230XC-SDL CCA230XN-SDL
SDL Homeward SDL Villager SDL Low-E SDL Chinchilla SDL Chord
$1,090.68
CANF3026D2HW1D
$1,316.62 $734.44 $1,064.82 $1,030.66 $1,064.82
SDL Granite
CANF3026DXJ
CCA210-LE
Low-E
CCA210XJ CCA210XC CCA210XN CCA210XR CCA211
Chord Rainglass Homeward
CCALD2618V24 CCALD2618V24 CCALD2618V24 CCALD2618V24 CCALD2618V24 CCALD2618V24
CANF3026D2VG1D CANF3026L1L CANF3026DXJ CANF3026DXC CANF3026DXN
CANF3026DXR CANF3026DXN CANF3026DXC CANF3026DXJ CANF3026L1L CANF3026D1VG1D
C D
C D C D
Granite
CANF3026L1L
$610.00 3' 0" x 6'8"
3' 0" x 6'8"
3' 0" x 6'8"
Slab
Insert
Grille
Slab
Grille
Insert
Insert
C D
Available
Available
Available
As you see the prices from the first table (except the left-most $610) are drawn pretty early, then stuff from the other tables, then the top CCA and the bottom CANF identifiers of the first table, then the C and D column headers of all tables, then some other entries, then the missing $610 price from the first table, then the row headers of all tables, and finally three occurrences of "Available" which are present but invisible on the page.
Any ideas what happened to that file?
I could speculate. Based on the producer properties of the PDFs it appears that the 2016 catalog is exported directly from MS Excel 2010 while the 2017 catalog has been created by Ghostscript. Furthermore the catalogs appear similar but different enough to assume that the 2017 one is not created from the same source Excel file but probably using a completely different tool chain with the task of producing something that looks similar to the former catalog, not something that looks identical to it let alone something that is internally built identically as it.
How to deal with this
How do I fix / read the second PDF properly?
First of all, there is nothing to fix: For generic PDFs there is no requirement to draw text in any particular order. Thus, the second PDF is not broken (at least not in this regard) and, therefore, cannot be fixed.
To read and analyse it properly, you have to do as @Max answered,
you will have to read out the bounding box of the pieces of text and use some heuristics to determine what belongs together.
Unfortunately you have not posted (the pivotal parts of) your text extraction strategy. For more detailed help please post it.
Considering, though, that you were able to build a parsing procedure in C# which parses first PDF line by line but not the second file because the line order in the second PDF is completely broken and it differs from the visual structure, I assume your custom text extraction strategy is based on the SimpleTextExtractionStrategy
in so far as it assumes the text bits to come in a line-by-line order. On top of that it probably uses the text bit coordinates (at least the x coordinates) to determine the column the text bit appears in.
The line-by-line assumption does not always hold, in particular not in case of your 2017 catalog. Thus, you should re-implement your strategy based on the LocationTextExtractionStartegy
code, i.e. first collect all bits of text from a page including its relevant coordinates (bounding box, base line, ...), then sort these bits (top-to-bottom, left-to-right), and then execute your additional logic to identify columns.