Getting string in right format from byte array

Question

I have a problem converting byte array to string in right format. Im reading byte array over TCP socket, it gives me bytes, one of the bytes is byte 158. If i read string with:

Encoding.Latin1.GetString(data)

it gives me string in format "blahblah\u009eblahblah". \u009e is the code for letter ž. The sting i need should be "blahblahžblahblah". How i can get the string in the right format?

Alredy tried other encodings like ACSII, UTF8 etc.. none of them got me the right format.

EDIT some code example how im getting the data and what im doing with it:

TcpClient client = new TcpClient(terminal.server_IP, terminal.port);
        NetworkStream stream = client.GetStream();
        stream.ReadTimeout = 2000;

        string message = "some message for terminal";
        byte[] msg = Encoding.Latin1.GetBytes(message);

        stream.Write(msg, 0, msg.Length);
        int bytes = stream.Read(data, 0, data.Length);
        string rsp = Encoding.Latin1.GetString(data, 0, bytes);

EDIT2 So, i dont know what was the problem... just created a new project for .NET Framework versoin 4.7.2, in that project its worikng fine. Thanks for suggestions for everyone, credit goes to @Jeppe Stig Nielsen

https://stackoverflow.com/questions/14057434/how-can-i-transform-string-to-utf-8-in-c — Emanuele, Oct 19 '21 at 07:48
That looks a lot like unicode. I really wonder why UTF8 didn't work. Can you post a [mcve] for us to reproduce this? — Fildor, Oct 19 '21 at 07:49
Is it that the byte array actually contains the textual representation of Unicode characters? how are you viewing the results. where are you getting the data from? — TheGeneral, Oct 19 '21 at 07:50
Could you provide the *byte array*, please? You can do it as `string dump = string.Join(" ", msg); Console.WriteLine(dump);`. Then, please, provide the desired *string* — Dmitry Bychenko, Oct 19 '21 at 08:04
Looks like something may have already incorrectly decoded some data using the wrong encoding. — Matthew Watson, Oct 19 '21 at 08:11
By the way, `\u009e` is NOT the code for `ž` - it's the code for the unprintable character "PRIVACY MESSAGE". Any escape sequence beginning with `\u` in a string is supposed to be a Unicode value (which is why it starts with `u` for Unicode). Something has messed up somewhere before you receive that string, it seems. — Matthew Watson, Oct 19 '21 at 08:16
@Dmitry Bychenko i dont have problems with msg, it work fine, the problem is with the rsp with decoding of data (i think i have the wrong Encoding type, but if i try with UTF8 or other, it gives me some strange symbols like: Encoding.UTF8.GetString(new byte[] {158}) = "�" — Taliga, Oct 19 '21 at 08:23
@Matthew Watson - so the problem may be on the terminal side if i read the data out with stream.Read(data, 0, data.Length); ? — Taliga, Oct 19 '21 at 08:27
@Taliga there are lot of smart people trying to help you here, If someone asks you to supply something they feel is pertinent to the clarity of the question, you should oblige and not discount such requests — TheGeneral, Oct 19 '21 at 08:33
@Taliga Yes, it looks like some incorrect encoding is being done BEFORE the string is returned to you via TCP/IP. — Matthew Watson, Oct 19 '21 at 08:35
@Taliga Yes, `Encoding.GetEncoding("Windows-1252").GetString(data)` works immediately on old .NET Framework. I would be interested to know if your project targeting .NET 5.0 (under Windows) would work if you said `Encoding.RegisterProvider(CodePagesEncodingProvider.Instance);` first (may need new assembly reference, maybe there is a simple tip near the word `CodePagesEncodingProvider` to add the reference (modifies the project file)), and then after that do `Encoding.GetEncoding("Windows-1252").GetString(data)`. Also, find out if `"Windows-1250"` is better (they agree on `0x9E`, however). — Jeppe Stig Nielsen, Oct 19 '21 at 09:03
I'm guessing whatever is sending you the data is not encoding it correctly in the first place — Charlieface, Oct 19 '21 at 09:19

Jeppe Stig Nielsen · Accepted Answer · 2021-10-19T09:09:12.140

0

Encoding.Latin1 is not usable in your case. True Latin 1 does not contain ž (LATIN SMALL LETTER Z WITH CARON).

If you want Windows-1252, use

Encoding.GetEncoding("Windows-1252").GetString(data)

This will turn bytes of decimal value 158 (hex 0x9E) into lowercase ž.

It may also be "Windows-1250" that you have. What other non-English letters do you expect in your text? Compare Windows-1252 and Windows-1250; they are different in general, but both agree that hex byte 0x9E (dec 158) is ž.

When on a .NET Core system where the above does not work immediately, attempt to execute:

Encoding.RegisterProvider(CodePagesEncodingProvider.Instance);
var goodText = Encoding.GetEncoding("Windows-1252").GetString(data);

Finding the type CodePagesEncodingProvider may need a reference to the assembly System.Text.Encoding.CodePages.dll.

edited Oct 19 '21 at 09:09

answered Oct 19 '21 at 07:57

Jeppe Stig Nielsen

60,409
11
110
181

tried Encoding.GetEncoding("Windows-1252") got error: 'Windows-1252' is not a supported encoding name. For information on defining a custom encoding, see the documentation for the Encoding.RegisterProvider method. – Taliga Oct 19 '21 at 08:07
@Taliga You are right, I was on the old .NET Framework (which also explains why I did not see [`Latin1`](https://learn.microsoft.com/en-us/dotnet/api/system.text.encoding.latin1) property which is new in .NET 5). You need to figure out if you have [Windows-1252](https://en.wikipedia.org/wiki/Windows-1252) or [Windows-1250](https://en.wikipedia.org/wiki/Windows-1250) or similar. Edit: Are you under Windows, or another OS? – Jeppe Stig Nielsen Oct 19 '21 at 08:12
Windows-1250 throws same error, im under windows, WPF project with .NET 5.0 – Taliga Oct 19 '21 at 08:16
@Taliga I added more to my answer above. See if it works. – Jeppe Stig Nielsen Oct 19 '21 at 08:24

Getting string in right format from byte array

1 Answers1