PDFs don't necessarily store text in a pattern that matches the visual representation of the text. The word "Hello" could be written as draw "Hello" at 10,10
or draw "H" at 10,10, "e" at 14,10, "l" at 18,10...
. It can also be draw "H" at 10,10, now draw a circle at 500,500, now show an image at 60,60, now draw "llo" at 18,10, now draw a square at 300,300, now draw "e" at 14,10
.
This last one is probably similar to what your case actually is. The PdfTextExtractor
pulls out blocks of text that are grouped together within a file. In the last case above it would return three strings in this order: "H", "llo", "e".
PDF producers that allow heavy formatting (Adobe InDesign and Illustrator are two good examples) are more likely to produce PDFs written in a non-linear fashion. Why? They honestly could care less about the data within, they only care about the visual representation of the PDF. (Actually, within recent years both of those products have done a better job at producing PDFs, although still not perfect.)
If you want to see the internal structure of your PDF and have Adobe Acrobat Pro launch Preflight (might be in Tools or Print Production). In the window that opens click on Options in the upper right corner and then Browser Internal PDF Structure. Click the puzzle icon labeled "BT" along the top. Open a given page and expand the "Contents" node. Each text entry starts with a BT
and ends with an ET
. Expand each one and you'll see something like (test) Tj
. The parentheses mark the start/stop of the actual text to output. Compare this to what you actually expect.
If you really, really must correct this at the iTextSharp level then you're in for some calculations. You'll need to either subclass TextExtractionStrategy
or implement the ITextExtractionStrategy
interface. See those links for basic details. Basically iTextSharp will do exactly the same as it was doing before but along with the text you'll get some coordinates and you'll have to figure out how to piece things together. You'll have to figure out letter proximity to determine where a letter should be injected into a word or if the letter actually forms a new word/sentence. Good luck!