0

There's a PDF in our database in binary. I streamed it out and saved it as a PDF file and tested with both sources and ended up with the same result: the PdfTextExtractor spells some words wrong.

For example, there is a word, "confirmed" in the PDF. After PdfTextExtractor converts it, it's spelled as "confrmed."

I step through the process in debug and it's spelled wrong immediately after it's converted by PdfTextExtractor, so I'm sure it's not inaccurate because of something I'm doing on my end.

Is there anything I can do to improve PdfTextExtractor's accuracy?

Here is the code I'm currently using:

var reader = new PdfReader(myBinaryPdfData.ToArray());
var output = new StringWriter();

for (var i = 1; i <= reader.NumberOfPages; i++)
{
    output.WriteLine(PdfTextExtractor.GetTextFromPage(reader, i, new LocationTextExtractionStrategy()));
}

output.ToString();
StronglyTyped
  • 2,134
  • 5
  • 28
  • 48
  • A detail that may help: When the characters "fi" are next to each other, it creates a problem. In the PDF, I can do a ctrl+f and it will find the "fi," but when I try to highlight either the "f" or the "i" separately, it selects them as one character. In the PDF, the dot of the "i" overlaps the "f." I'm assuming this is what is causing the problem - any idea how to fix? – StronglyTyped Apr 23 '12 at 20:58
  • Chris Haas in his answer provided a perfect explanation of what could be going on. There are two more possibilities: (1) that the "fi" characters have been transformed into the "fi" *ligature* by the PDF generating software; (2) that the PDF originates from a scanned page, was OCR-ed and the OCR didn't catch the word correctly. – Kurt Pfeifle Jul 29 '12 at 16:48

1 Answers1

3

PDFs don't necessarily store text in a pattern that matches the visual representation of the text. The word "Hello" could be written as draw "Hello" at 10,10 or draw "H" at 10,10, "e" at 14,10, "l" at 18,10.... It can also be draw "H" at 10,10, now draw a circle at 500,500, now show an image at 60,60, now draw "llo" at 18,10, now draw a square at 300,300, now draw "e" at 14,10.

This last one is probably similar to what your case actually is. The PdfTextExtractor pulls out blocks of text that are grouped together within a file. In the last case above it would return three strings in this order: "H", "llo", "e".

PDF producers that allow heavy formatting (Adobe InDesign and Illustrator are two good examples) are more likely to produce PDFs written in a non-linear fashion. Why? They honestly could care less about the data within, they only care about the visual representation of the PDF. (Actually, within recent years both of those products have done a better job at producing PDFs, although still not perfect.)

If you want to see the internal structure of your PDF and have Adobe Acrobat Pro launch Preflight (might be in Tools or Print Production). In the window that opens click on Options in the upper right corner and then Browser Internal PDF Structure. Click the puzzle icon labeled "BT" along the top. Open a given page and expand the "Contents" node. Each text entry starts with a BT and ends with an ET. Expand each one and you'll see something like (test) Tj. The parentheses mark the start/stop of the actual text to output. Compare this to what you actually expect.

If you really, really must correct this at the iTextSharp level then you're in for some calculations. You'll need to either subclass TextExtractionStrategy or implement the ITextExtractionStrategy interface. See those links for basic details. Basically iTextSharp will do exactly the same as it was doing before but along with the text you'll get some coordinates and you'll have to figure out how to piece things together. You'll have to figure out letter proximity to determine where a letter should be injected into a word or if the letter actually forms a new word/sentence. Good luck!

Community
  • 1
  • 1
Chris Haas
  • 53,986
  • 12
  • 141
  • 274