0

In my c# code, I am extracting text from a pdf, but the text it gives back has some weird characters, if I search for "CLE action" when I know there is the text "CLE action" in the pdf document, it gives me a false, but I found out that after extracting the text, the space between the two words has a ascii byte value of 63...

Is there a quick way to fix the encoding on the text?

Currently I am using this method, but I think it's slow and only works for that one character. Is there some fast method that works for all characters?

    public static string fix_encoding(string src)
    {
        StringWriter return_str = new StringWriter();
        byte[] byte_array = Encoding.ASCII.GetBytes(src.Substring(0, src.Length));
        int len = byte_array.Length;
        byte byt;
        for(var i=0; i<len; i+=1)
        {
            byt = byte_array[i];
            if (byt == 63)
            {
                return_str.Write(" ");
            }
            else
            {
                return_str.Write(Encoding.ASCII.GetString(byte_array, i, 1));
            }
        }
        return return_str.ToString();
    }

This is how I call this method:

                StringWriter output = new StringWriter();
                output.WriteLine(PdfTextExtractor.GetTextFromPage(reader, page, new SimpleTextExtractionStrategy()));
                currentText = fix_encoding(output.ToString());
omega
  • 40,311
  • 81
  • 251
  • 474
  • 2
    Where does `src` come from, exactly? If you're getting a `?` (which is what ASCII 63 is) unexpectedly, that's probably because you used the wrong encoding to start with. – Jon Skeet Dec 20 '12 at 18:52
  • 2
    You need to go way back to the point where you are actually decoding the PDF, and then decode it using correct encoding. – Esailija Dec 20 '12 at 18:53
  • @Jon Skeet, I am getting it using the iTextSharp method to get the text from the page. (see above) – omega Dec 20 '12 at 19:08
  • @omega: Why are you using a `StringWriter` for that? Why not just `currentText = PdfTextExtractor.GetTextFromPage(reader, page, new SimpleTextExtractionStrategy())`? – Jon Skeet Dec 20 '12 at 19:27

1 Answers1

4

It is possible that the spaces you extract from the pdf file, are no real spaces (" "), but other kind of spaces defined in unicode. For example a "em space" or a "non break space", see this list or here for a overview.

If the extracted text contains such a space, and you search the text for a normal space, you won't find it, because it is not identical.

Your fix_encoding function converts the string to ASCII. All the unusual kinds of spaces do not exist in ASCII. By default non-ASCII characters are converted to a question mark. So in your fix_encoding function, you see a question mark, even though the original text has a different character.

This means in your fix_encoding function, you should not convert to ASCII, but replace unusual spaces with a normal space. The following function will convert all non-ASCII characters, but you could also use Char.IsWhiteSpace to determine which characters to replace with a normal space.

public static string remove_non_ascii(string src)
{
    return Regex.Replace(src, @"[^\u0000-\u007F]", " ");
}
Community
  • 1
  • 1
wimh
  • 15,072
  • 6
  • 47
  • 98