1

I am using iTextSharp to extract text from PDF documents, but some text files that are encoding ISO-8859-1 are not displayed correctly.

Below is my code, if anyone can help me I will be grateful.

public string ReadPdfFile(string fileName)
{
    StringBuilder text = new StringBuilder();
    PdfReader pdfReader = null;

    try
    {
        if (File.Exists(fileName))
        {
            pdfReader = new PdfReader(fileName);
            Encoding encoding = Encoding.GetEncoding("iso8859-2");

            for (int page = 1; page <= pdfReader.NumberOfPages; page++)
            {
                string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, new LocationTextExtractionStrategy());
                currentText = encoding.GetString(ASCIIEncoding.Convert(Encoding.UTF8, encoding, Encoding.Default.GetBytes(currentText)));
                text.Append(currentText);
            }
            pdfReader.Close();
        }
        return text.ToString();
    }
    catch
    {
        return string.Empty;
    }
    finally
    {
        if (pdfReader != null) pdfReader.Close();
    }
}
Camila Reis
  • 151
  • 1
  • 10
  • 1
    Please see this post explaining why you should never use `encoding.GetString(ASCIIEncoding.Convert(Encoding.UTF8, encoding, Encoding.Default.GetBytes(currentText)))`. http://stackoverflow.com/a/10191879/231316 – Chris Haas Oct 14 '13 at 19:55
  • @ChrisHaas thanks for the reply. I understand, but even if I remove that line my problem will not be solved. Must be recognized that some characters like: ç, ^, ~, and so on. – Camila Reis Oct 14 '13 at 20:12
  • 1
    You should examine the raw bytes that are coming out of `GetTextFromPage()`. For instance, dump `BitConverter.ToString(System.Text.Encoding.UTF8.GetBytes(currentText))`. If the bytes are what you expect then you have a logic problem later on. If they're not what you expect then there's a really good chance (based on the number of times we've seen this question) that the source PDF is corrupt. Are you able to provide the source PDF? – Chris Haas Oct 14 '13 at 20:27
  • Apparently the file is not corrupted. I think the problem is in the encoding of the file is not English. – Camila Reis Oct 14 '13 at 20:50
  • 1
    So is `GetTextFromPage()` getting what you expect? If not, can you post what it is getting and what you expected? – Chris Haas Oct 14 '13 at 21:00
  • Also note that the version of iTextSharp matters. Plenty of improvements were added over the years. – Bruno Lowagie Oct 15 '13 at 07:01
  • The GetTextFromPage () is not returning what I expect. What do you want me to send the bits? I'm using version 5.4.4 of iTextSharp. – Camila Reis Oct 15 '13 at 11:20
  • Expected: 49-6E-74-72-6F-64-75-A7-C3-A3-C3-6F-20-61-20-49-6E-74-65-6C-69-67 Received: 49-6E-74-72-6F-64-75-63-20-C2-B8-61-6F-20-9C-CB-20-61-20-49-6E-74-65-6C-69-67 – Camila Reis Oct 15 '13 at 11:35
  • Is this `UTF8.GetBytes()` or `Default.GetBytes()`? It appears to be the latter since `A7` cannot exist alone in UTF-8. Can you post using UTF-8 for both? `Default` is really hard to work with and even MSDN recommends against using it. http://msdn.microsoft.com/en-us/library/system.text.encoding.default(v=vs.100).aspx – Chris Haas Oct 15 '13 at 13:27
  • I tried opening the file with notepad and identified that how it was generated made ​​various sources were part of the content (like: GoTo, EndObject and others). It seems that it is generating the error. – Camila Reis Oct 15 '13 at 13:58

0 Answers0