determine text code type and cast to default

Question

I have an input string in alien coding system, i.e.: "\\U+1043\\U+1072\\U+1073\\U+1072\\U+1088\\U+1080\\U+1090\\U+1085\\U+1086\\U+1089\\U+1090\\U+1100"

And I want to cast it to my default code system(System.Text.Encoding.Default):

-       System.Text.Encoding.Default    {System.Text.SBCSCodePageEncoding}  System.Text.Encoding {System.Text.SBCSCodePageEncoding}
+       [System.Text.SBCSCodePageEncoding]  {System.Text.SBCSCodePageEncoding}  System.Text.SBCSCodePageEncoding
        BodyName    "koi8-r"    string
        CodePage    1251    int
+       DecoderFallback {System.Text.InternalDecoderBestFitFallback}    System.Text.DecoderFallback {System.Text.InternalDecoderBestFitFallback}
+       EncoderFallback {System.Text.InternalEncoderBestFitFallback}    System.Text.EncoderFallback {System.Text.InternalEncoderBestFitFallback}
        EncodingName    "Cyrillic (Windows)"    string
        HeaderName  "windows-1251"  string
        IsBrowserDisplay    true    bool
        IsBrowserSave   true    bool
        IsMailNewsDisplay   true    bool
        IsMailNewsSave  true    bool
        IsReadOnly  true    bool
        IsSingleByte    true    bool
        WebName "windows-1251"  string
        WindowsCodePage 1251    int

How I could determine code system and how to cast it?

do you know what the file, or part of the file, should contain after successful conversion? — Jodrell, Nov 29 '12 at 10:40

score 11 · Answer 1 · edited May 23 '17 at 11:59

I'm not sure if I really understand your question.

In .NET, when you have a string object then you don't need to care about different encodings. All .NET strings use the same encoding: Unicode (or more precisely: UTF-16).

Different text encodings only come into play, when you turn a string object into a byte sequence (e.g. to write it to a text file) or vice versa. I assume you are talking about this. To convert a byte sequence from one encoding to another, you could write:

byte[] input = ReadInput(); // e.g. from a file
Encoding decoder = Encoding.GetEncoding("encoding of input");
string str = decoder.GetString(input);
Encoding encoder = Encoding.GetEncoding("encoding of output");
byte[] ouput = encoder.GetBytes(str);

Of course you need to replace encoding of input and encoding of output with proper encoding names. MSDN has a list of all supported encodings.

You need to know the encoding of the input, either by convention or based on metadata or something. You cannot reliably determine/guess an unknown encoding, but there are some tricks and heuristics you could apply. See How can I detect the encoding/codepage of a text file.

Edit:

"U+xxxx" is how you usually refer to a specific Unicode code point (the number assigned to a Unicode character), e.g. the code point of the letter "A" (Latin capital A) is U+0041.

Is your input string actually "\\U+1043..." (backslash, backslash, capital U etc.) or is it only displayed like this e.g. in a debugger window? If it's the first then somebody made a mistake while encoding the text, maybe by trying to write a Unicode literal and accidentaly escaping the backslash by writing a second one (Edit2: Or the characters were deliberately saved in an escaped way to write them into an ASCII-encoded file/stream/etc). As far as I know, the .NET encoding classes do not help you here; you need to parse the string by hand.

By the way, the numbers in your example are strange. In the standard notation, the number after "U+" is a hex number, not a decimal number. But if you read the code points as hex numbers then they refer to characters from completely unrelated script systems (Burmese, Georgian Mkhedruli, Hangul Jamo); read as decimal numbers they all refer to Cyrillic letters, though.

Edit3: To parse it, well, look for substrings in the form \\U+xxxx (with x being a digit), convert xxxx to an int n, create a char with that code point (Char.ConvertFromUtf32(n)) and replace the whole substring by that char.

sorry, but I could not solve the problem. You write that UTF-16 is default encoding to .Net but why System.Text.Encoding.Default is koi8-r? Or it used only to non .Net strings and than convert its to utf-16? So, I have problems with determinate encoding to the "\\U+1043 ..." string, can you help me please(I trying cp1251, utf-8/16, koi8-r and several other in emacs but I could not found suitable)? — RomanKovalev, Nov 30 '12 at 08:51
@psct: No, UTF-16 is not the default encoding, it's the internal encoding of .NET strings. The default encoding depends on your system's culture settings. — Sebastian Negraszus, Nov 30 '12 at 09:57
ok. but what I should do with it strange string? I really should view it on all encodings? May be you saw it later - U+n, may be it some type of unicode? — RomanKovalev, Nov 30 '12 at 22:43
thanks for the help, I parsing russian text, may be it will help to find a solution. Emacs displays correctly it. But notepad did not(\\U+...). If emacs displays correctly perhaps it means that text is valid? — RomanKovalev, Dec 01 '12 at 21:05
@psct - this probably means that the font you are using in notepad does not have the letters included — John Palmer, Dec 03 '12 at 21:11
Not taking into account the encoding of the string will, in some cases, make you not understand the problem. For example - try to zip several files in a directory and rename one of the files with cyrilic chars and have fun with the resulting archive :) — Ognyan Dimitrov, Apr 20 '15 at 13:49

determine text code type and cast to default

1 Answers1