1

Okay, I'm trying to extract text from a PDF file using iTextSharp... that's all I want. However, when I extract the text, it's giving me garbage instead of text.

Here's the code I'm using...

List<String> pdfText = new List<string>();
for (int page = 1; page <= reader.NumberOfPages; page++)
{
    ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
    String strPage = PdfTextExtractor.GetTextFromPage(reader, page, its);

    strPage = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default,
              Encoding.UTF8, Encoding.Default.GetBytes(strPage)));

    pdfText.Add(strPage);
}

I then save that data to a text file, but instead of readable text, I get text that looks like binary data... non-printable characters all over the place. I'd post an image of what I see, but it won't let me. Sorry about that.

I have tried without the encoding attempt, and it didn't work any better... still binary-looking data (viewed in Notepad), though I'm not certain it's identical to that produced with the encoding attempt.

Any idea what is happening and how to fix it?

user3313540
  • 23
  • 1
  • 4
  • 1
    See if this helps for starters http://stackoverflow.com/a/10191879/231316 – Chris Haas Apr 24 '14 at 18:36
  • Please supply the PDF file in question. Some PDF files don't contain the information on how to translate the glyph identifiers to Unicode, some actually even try to mislead. – mkl Apr 25 '14 at 07:19
  • Chris, I'm not sure that's the problem, as I've tried the code without the encoding line, and the problem persists. I'll try it again, though, just to cover all bases. – user3313540 Apr 25 '14 at 13:19
  • mkl, if the PDf employs a custom encoding method, how can Adobe display it properly unless the PDF file contains the information needed to handle it? Anyway, the PDF file I'm trying to parse can be located at the following url: http://fileshare.homestead.com/files/share/9240a920-f0eb-479f-b186-88fe7bcf4337.pdf – user3313540 Apr 25 '14 at 13:20
  • And here is an example of the garbled text that this method extracts: ++ ' '( ())$$$$* ** ** *+ + +$ +$ $$ $$ + + ( (,- ,- ../ /% 012 % 012 / /3&2 3&2 ../ /#%2 – user3313540 Apr 25 '14 at 13:32
  • 1
    *if the PDf employs a custom encoding method, how can Adobe display it properly unless the PDF file contains the information needed to handle it?* - PDFs can contain embedded fonts. If they do, all information on the 'meaning' of the encoding can be left out; all the PDF needs to contain are a set of graphical rendering instructions for each byte of the used encoding. It does *not* need the information which Unicode letter is represented by that drawing. – mkl Apr 25 '14 at 13:35
  • Okay, as best I can tell, the PDF file doesn't contain text, it contains font images of each character. So I can't do a text extraction, I need to do an OCR extraction. – user3313540 Apr 25 '14 at 15:17

3 Answers3

0

Please open the document in Adobe Reader, then try to copy/paste part of the text.

If you do this with the first page, you'll get:

The following policy (L30304) has been archived by Alpha II. Many policies are part of a larger jurisdiction, than is indicated by the policy. This policy covers the following states:

• INDIANA

• MICHIGAN

However, if you do this with the second page, you'll get: enter image description here

In other words: copy/pasting from Adobe Reader gives you garbage.

And if copy/pasting from Adobe Reader gives you garbage, any text extraction tool will give you garbage. You'll need to OCR the document to solve this problem.

Regarding your additional question in the comments: if the PDf employs a custom encoding method, how can Adobe display it properly unless the PDF file contains the information needed to handle it?

This question is answered in a 14-minute movie: https://www.youtube.com/watch?v=wxGEEv7ibHE

Bruno Lowagie
  • 75,994
  • 9
  • 109
  • 165
0

try this code:

 List<String> pdfText = new List<string>();
for (int page = 1; page <= reader.NumberOfPages; page++)
{
    ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
    PdfTextExtractor.GetTextFromPage(reader, page, its);

    strPage = its.GetResultantText();

    pdfText.Add(strPage);
}
sbeci
  • 31
  • 2
0

Try this code, Worked for me

 using (PdfReader reader = new PdfReader(path))
            {
                StringBuilder text = new StringBuilder();

                for (int i = 1; i <= reader.NumberOfPages; i++)
                {
                    text.Append(PdfTextExtractor.GetTextFromPage(reader, i));
                }

                return text.ToString();
            }
Munavvar
  • 802
  • 1
  • 11
  • 33