0

Below is the code with a description of my problem:

  1. I need to find the encoding of this file, but not now!

    string FilePath = @"C:\01 New.txt";
    System.IO.FileStream inFile = new System.IO.FileStream(FilePath, System.IO.FileMode.Open,System.IO.FileAccess.Read);
    byte[] binaryData = new Byte[inFile.Length];
    long bytesRead = inFile.Read(binaryData, 0, (int)inFile.Length);
    inFile.Close();
    string base64String = System.Convert.ToBase64String(binaryData, 0, binaryData.Length);// Converting ToBase64String
    Console.WriteLine("base64String is " + base64String);
    

    Please assume that the above process is done by something else, and it only returns "base64String". Now I need to read it properly.

  2. For that, I need the "ENCODING" of the base64String:

    byte[] s = Convert.FromBase64String(base64String);
    switch (GET_ENCODING(base64String))
    {
      case "ASCII":
        Console.WriteLine("ASCII text is " + Encoding.ASCII.GetString(s).Trim()); break;
      case "Default":
        Console.WriteLine("Default text is " + Encoding.Default.GetString(s).Trim()); break;
      case "UTF7":
        Console.WriteLine("UTF7 text is " + Encoding.UTF7.GetString(s).Trim()); break;
      case "UTF8":
        Console.WriteLine("UTF8 text is " + Encoding.UTF8.GetString(s).Trim()); break;
      case "BigEndianUnicode":
        Console.WriteLine("BigEndianUnicode " + Encoding.BigEndianUnicode.GetString(s).Trim()); break;
       case "UTF32":
         Console.WriteLine("UTF32 text is " + Encoding.UTF32.GetString(s).Trim()); break;
       default:
         break;
      }
    
carols10cents
  • 6,943
  • 7
  • 39
  • 56
Sunny Fober
  • 33
  • 1
  • 4
  • 1
    and why are you shouting? – MaVRoSCy Oct 24 '13 at 09:14
  • 2
    WHAT LANGUAGE IS THIS? CAN YOU ADJUST YOUR TAGS? – x29a Oct 24 '13 at 09:23
  • 3
    As has been said in response to many a question before: it's practically impossible to figure out what encoding a binary blob is in! You should always, **always** have meta data that tells you what encoding something is in. If you don't, you're mostly screwed. – deceze Oct 24 '13 at 09:34

1 Answers1

3

Base64 encoding is not relevant to the problem, because you know this is the source encoding. Basically you've got a stream of bytes to encode as text, without knowing the target encoding or character set. This means your text is really compromised; as @deceze commented the best thing is to ensure the encoding is always known/available.

If the text is XML, HTML or MIME then you can do this in two passes:

  1. Encode as ASCII/UTF-8, then parse/search for a charset attribute whose value is any of "UTF-8", "ISO-8859-1", etc.
  2. Encode to the character set identified in step 1.

Otherwise you'll need a heuristic approach to detect the encoding. This won't be 100% reliable. See the links below:

Edit: it's possible for XML/HTML to be encoded as something other than ASCII/UTF-8; this may be true also for MIME. This means that even for these file types the heuristic approach will be required, unless you know that the encoding can only be ASCII/UTF-8/ISO-8859-1 whose first 128 characters are the same.

Community
  • 1
  • 1
groverboy
  • 1,133
  • 8
  • 20