0

I am using ITextSharp version 5.5.3.0 and I am trying to extract text from a pdf in C#. The pdf is a form, and not an image. This is the code:

            var text = new StringBuilder();

        // The PdfReader object implements IDisposable.Dispose, so you can
        // wrap it in the using keyword to automatically dispose of it
        using (var pdfReader = new PdfReader(inFileName))
        {
            // Loop through each page of the document
            for (var page = 1; page <= pdfReader.NumberOfPages; page++)
            {
                ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();

                var currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);

                currentText = Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));

                text.Append(currentText);
            }
        }

        return text.ToString();
    }

The returned text is unusable. The pdf was generated with GhostScript.

Does anyone have a suggestion regarding what the problem cound be? Or any suggestions?

  • Can the text be extracted with Acrobat? If it can please post the pdf. – Paulo Soares Sep 23 '15 at 11:12
  • Apparently, the problem was missing fonts on the system I am using. 2 of the fonts from the pdf appear as "T3_Font_0", "T3_Font_1". I will try to find out what fonts were used, install them on the system, and get back with the findings. Also, not even Windows knows how to interpret the pdf text. I copy pasted some text and if pasted nonsense. – user1685101 Sep 23 '15 at 11:35
  • 2
    Partially related, completely remove the line `currentText = Encoding...` because at best it doesn't do a single thing and at worst it actually destroys your text. See [this](http://stackoverflow.com/a/10191879/231316) for more. – Chris Haas Sep 23 '15 at 14:57
  • I was unable to test my theory, as we went in a different direction with the solution, and ITextSharp was removed from the libraries. SO, I will not be able to check if adding the required fonts, fixes the problem. Thanks for your answers guys. – user1685101 Sep 24 '15 at 10:17

0 Answers0