C# - Comparing strings of different encodings

Question

Using C#, I fetch a TextBox.Text value from an .ascx page. When I compare the equality of the value to a regular string object inside a LINQ-query, it always returns false.

I have come to the conclusion that they are differently encoded, but have so far had no luck in converting or comparing them.

docname = "Testdoc 1.docx"; //regular string created in C#
fetchedVal = ((TextBox)e.Item.FindControl("txtSelectedDocs")).Text; //UTF-8

The above two strings are identical when represented as literals, but comparing the byte[] they are obviously different due to the encoding.

I've tried alot of different things, such as:

System.Text.Encoding.Default.GetString(utf8.GetBytes(fetchedVal));

but that will return the value "TestdocÂ 1.docx".

If I instead try

System.Text.Encoding.Default.GetString(System.Text.Encoding.Default.GetBytes(fetchedVal));

it returns "Testdoc 1.docx" but an Equals()-check still returns false.

I have also tried the following, which seem to be the recommended approach, but with no luck:

byte[] utf8Bytes = Encoding.UTF8.GetBytes(fetchedVal);
byte[] unicodeBytes = Encoding.Convert(Encoding.UTF8, Encoding.Unicode, utf8Bytes);
string fetchedValConverted = Encoding.Unicode.GetString(unicodeBytes);

The culprit appears to be the whitespace, because when examining the byte sequence it's always the seventh byte that differs.

How do you properly convert from UTF-8 to default string encoding in C#?

I am not sure what exactly is the problem here, but I want to point you to string's Normalize function. Don't know if this will fix your problem, but it could be useful for you to normalize the strings before comparing them. http://msdn.microsoft.com/en-us/library/system.string.normalize(v=vs.110).aspx — David S., Sep 29 '14 at 15:33
See @SLaks' answer, this hasn't got to do with the encoding. Within .NET, all strings are equal, namely Unicode encoded in UTF-16. The culprit here is a non-breaking space, see [HTML encoding issues - “Â” character showing up instead of “ ”](http://stackoverflow.com/questions/1461907/html-encoding-issues-%C3%82-character-showing-up-instead-of-nbsp). Where is this text in your textbox pasted from, and how is that outputted? — CodeCaster, Sep 29 '14 at 15:36
Just as a response to @DavidS., I have explored the `Normalize` function aswell, without success. @CodeCaster, the `TextBox.Text` is set from JQuery. I missed the fact that it can be due to that! — Daniel B, Sep 29 '14 at 16:28

score 7 · Accepted Answer · answered Sep 29 '14 at 15:33

7

Strings don't have encodings or byte arrays. Encodings only come into play when you convert a string into a byte array; you can only do that by specifying which encoding to use to pick bytes.

It sounds like you actually simply have different characters in your strings. You might have an invisible character in one of them, or they might have different characters that look the same.

To find out, look at the Unicode codepoint values of each character in each string (eg, (int) str[0]).

answered Sep 29 '14 at 15:33

SLaks

868,454
176
1,908
1,964

This seems very plausible, I will look into it first thing in the morning! – Daniel B Sep 29 '14 at 16:30
1

This was the problem. Somehow a white space character (` `, `U+0020`) was in fact a non-breaking space (` `). – Daniel B Oct 01 '14 at 08:17

C# - Comparing strings of different encodings

1 Answers1

Linked