Using iTextSharp, trying to extract text from a PDF gives non-readable data

Question

Okay, I'm trying to extract text from a PDF file using iTextSharp... that's all I want. However, when I extract the text, it's giving me garbage instead of text.

Here's the code I'm using...

List<String> pdfText = new List<string>();
for (int page = 1; page <= reader.NumberOfPages; page++)
{
    ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
    String strPage = PdfTextExtractor.GetTextFromPage(reader, page, its);

    strPage = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default,
              Encoding.UTF8, Encoding.Default.GetBytes(strPage)));

    pdfText.Add(strPage);
}

I then save that data to a text file, but instead of readable text, I get text that looks like binary data... non-printable characters all over the place. I'd post an image of what I see, but it won't let me. Sorry about that.

I have tried without the encoding attempt, and it didn't work any better... still binary-looking data (viewed in Notepad), though I'm not certain it's identical to that produced with the encoding attempt.

Any idea what is happening and how to fix it?

See if this helps for starters http://stackoverflow.com/a/10191879/231316 — Chris Haas, Apr 24 '14 at 18:36
Please supply the PDF file in question. Some PDF files don't contain the information on how to translate the glyph identifiers to Unicode, some actually even try to mislead. — mkl, Apr 25 '14 at 07:19
Chris, I'm not sure that's the problem, as I've tried the code without the encoding line, and the problem persists. I'll try it again, though, just to cover all bases. — user3313540, Apr 25 '14 at 13:19
mkl, if the PDf employs a custom encoding method, how can Adobe display it properly unless the PDF file contains the information needed to handle it? Anyway, the PDF file I'm trying to parse can be located at the following url: http://fileshare.homestead.com/files/share/9240a920-f0eb-479f-b186-88fe7bcf4337.pdf — user3313540, Apr 25 '14 at 13:20
And here is an example of the garbled text that this method extracts: ++ ' '( ())$$$$* ** ** *+ + +$ +$ $$ $$ + + ( (,- ,- ../ /% 012 % 012 / /3&2 3&2 ../ /#%2 — user3313540, Apr 25 '14 at 13:32
*if the PDf employs a custom encoding method, how can Adobe display it properly unless the PDF file contains the information needed to handle it?* - PDFs can contain embedded fonts. If they do, all information on the 'meaning' of the encoding can be left out; all the PDF needs to contain are a set of graphical rendering instructions for each byte of the used encoding. It does *not* need the information which Unicode letter is represented by that drawing. — mkl, Apr 25 '14 at 13:35
Okay, as best I can tell, the PDF file doesn't contain text, it contains font images of each character. So I can't do a text extraction, I need to do an OCR extraction. — user3313540, Apr 25 '14 at 15:17

score 0 · Answer 1 · answered Apr 25 '14 at 13:27

Please open the document in Adobe Reader, then try to copy/paste part of the text.

If you do this with the first page, you'll get:

The following policy (L30304) has been archived by Alpha II. Many policies are part of a larger jurisdiction, than is indicated by the policy. This policy covers the following states:

• INDIANA

• MICHIGAN

However, if you do this with the second page, you'll get: enter image description here

In other words: copy/pasting from Adobe Reader gives you garbage.

And if copy/pasting from Adobe Reader gives you garbage, any text extraction tool will give you garbage. You'll need to OCR the document to solve this problem.

Regarding your additional question in the comments: if the PDf employs a custom encoding method, how can Adobe display it properly unless the PDF file contains the information needed to handle it?

This question is answered in a 14-minute movie: https://www.youtube.com/watch?v=wxGEEv7ibHE

score 0 · Answer 2 · answered Sep 17 '14 at 10:33

try this code:

 List<String> pdfText = new List<string>();
for (int page = 1; page <= reader.NumberOfPages; page++)
{
    ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
    PdfTextExtractor.GetTextFromPage(reader, page, its);

    strPage = its.GetResultantText();

    pdfText.Add(strPage);
}

score 0 · Answer 3 · answered Jul 15 '16 at 09:58

Try this code, Worked for me

 using (PdfReader reader = new PdfReader(path))
            {
                StringBuilder text = new StringBuilder();

                for (int i = 1; i <= reader.NumberOfPages; i++)
                {
                    text.Append(PdfTextExtractor.GetTextFromPage(reader, i));
                }

                return text.ToString();
            }

Using iTextSharp, trying to extract text from a PDF gives non-readable data

3 Answers3