Itextsharp - GetTextFromPage does not recognize iso-8859 characters

Question

I am using iTextSharp to extract text from PDF documents, but some text files that are encoding ISO-8859-1 are not displayed correctly.

Below is my code, if anyone can help me I will be grateful.

public string ReadPdfFile(string fileName)
{
    StringBuilder text = new StringBuilder();
    PdfReader pdfReader = null;

    try
    {
        if (File.Exists(fileName))
        {
            pdfReader = new PdfReader(fileName);
            Encoding encoding = Encoding.GetEncoding("iso8859-2");

            for (int page = 1; page <= pdfReader.NumberOfPages; page++)
            {
                string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, new LocationTextExtractionStrategy());
                currentText = encoding.GetString(ASCIIEncoding.Convert(Encoding.UTF8, encoding, Encoding.Default.GetBytes(currentText)));
                text.Append(currentText);
            }
            pdfReader.Close();
        }
        return text.ToString();
    }
    catch
    {
        return string.Empty;
    }
    finally
    {
        if (pdfReader != null) pdfReader.Close();
    }
}

Please see this post explaining why you should never use `encoding.GetString(ASCIIEncoding.Convert(Encoding.UTF8, encoding, Encoding.Default.GetBytes(currentText)))`. http://stackoverflow.com/a/10191879/231316 — Chris Haas, Oct 14 '13 at 19:55
@ChrisHaas thanks for the reply. I understand, but even if I remove that line my problem will not be solved. Must be recognized that some characters like: ç, ^, ~, and so on. — Camila Reis, Oct 14 '13 at 20:12
You should examine the raw bytes that are coming out of `GetTextFromPage()`. For instance, dump `BitConverter.ToString(System.Text.Encoding.UTF8.GetBytes(currentText))`. If the bytes are what you expect then you have a logic problem later on. If they're not what you expect then there's a really good chance (based on the number of times we've seen this question) that the source PDF is corrupt. Are you able to provide the source PDF? — Chris Haas, Oct 14 '13 at 20:27
Apparently the file is not corrupted. I think the problem is in the encoding of the file is not English. — Camila Reis, Oct 14 '13 at 20:50
So is `GetTextFromPage()` getting what you expect? If not, can you post what it is getting and what you expected? — Chris Haas, Oct 14 '13 at 21:00
Also note that the version of iTextSharp matters. Plenty of improvements were added over the years. — Bruno Lowagie, Oct 15 '13 at 07:01
The GetTextFromPage () is not returning what I expect. What do you want me to send the bits? I'm using version 5.4.4 of iTextSharp. — Camila Reis, Oct 15 '13 at 11:20
Expected: 49-6E-74-72-6F-64-75-A7-C3-A3-C3-6F-20-61-20-49-6E-74-65-6C-69-67 Received: 49-6E-74-72-6F-64-75-63-20-C2-B8-61-6F-20-9C-CB-20-61-20-49-6E-74-65-6C-69-67 — Camila Reis, Oct 15 '13 at 11:35
Is this `UTF8.GetBytes()` or `Default.GetBytes()`? It appears to be the latter since `A7` cannot exist alone in UTF-8. Can you post using UTF-8 for both? `Default` is really hard to work with and even MSDN recommends against using it. http://msdn.microsoft.com/en-us/library/system.text.encoding.default(v=vs.100).aspx — Chris Haas, Oct 15 '13 at 13:27
I tried opening the file with notepad and identified that how it was generated made various sources were part of the content (like: GoTo, EndObject and others). It seems that it is generating the error. — Camila Reis, Oct 15 '13 at 13:58

Itextsharp - GetTextFromPage does not recognize iso-8859 characters

0 Answers0