1

I am comparing two strings, one String I receive from a server with 32 characters with another one I calculate with the following code:

string getMd5(string fileName)
{
    using (var md5 = MD5.Create())
    {
        using (var stream = File.OpenRead(fileName))
        {
            return BitConverter.ToString(md5.ComputeHash(stream)).Replace("-", "‌​").ToLower();
        }
    }
}

The problem is, that even when the two strings seems identical, the comparison fails because the string returned by the function above contains more characters than the one I receive. Please, see picture attached:

enter image description here

So, how do I solve this?

Thank you.

Community
  • 1
  • 1
Joe Almore
  • 4,036
  • 9
  • 52
  • 77
  • It does not look like it contains more characters. How do you know? – Khalil Khalaf Aug 25 '16 at 17:58
  • @FirstStep Can you see the `currentMd5.Length`? How is this possible? You see and count 32 characters, but the length says there are 62, hence comparison fails. – Joe Almore Aug 25 '16 at 17:59
  • 1
    @FirstStep, see the `Length` value in attached watch pic – Rahul Aug 25 '16 at 17:59
  • Are you sure that the encoding of text read from both the files is the same? – Ani Aug 25 '16 at 17:59
  • Now I see it sorry. I would compare just the first 32 then (whatever currentMd5 size) or just check if currentMd5 exists in the second string. Not the best approach – Khalil Khalaf Aug 25 '16 at 18:00
  • 3
    Why are you converting the outputs to strings instead of comparing the `byte[]` directly? – Lee Aug 25 '16 at 18:01
  • 1
    @FirstStep That assumes that the different character is at the end, and not at the start or in the middle. – Servy Aug 25 '16 at 18:01
  • It is probably the encoding, the stream using Unicode. – Andy G Aug 25 '16 at 18:02
  • I don't think it's the encoding because C# `string` is all the same encoding. Any `int` is the same as another other, right? Shouldn't any `string` be the same as any other? I'm wondering if there is possibly exactly 30 dashes and this is somehow messing it up: `.Replace("-", "‌​")` – Quantic Aug 25 '16 at 18:04
  • @Lee Because the app receives the `MD5` as a `String`, hence I calculate the File `MD5` and convert it to `String` to make the comparison. – Joe Almore Aug 25 '16 at 18:04
  • I'll bet Convert.ToString is using Unicode to encode the string, which would double the byte count. Try using new UTF8Encoding().GetString instead. – Kevin Aug 25 '16 at 18:06

1 Answers1

7

That's because the "‌​" in your code actually contains an two invisible Unicode characters - a 'ZERO WIDTH NON-JOINER' (U+200C) and a 'ZERO WIDTH SPACE' (U+200B). My guess is that they got there because at some point the source code fragment went through a word processor such as Word or the like. Use string.Empty or have a free one - "".

cynic
  • 5,305
  • 1
  • 24
  • 40
  • Then why do all of these return `0` for length? `("" + "").Length`, `"ab".Replace("a", "").Replace("b", "").Length`, `"ab".Replace("a", "").Replace("b", "").ToLower().Length` – Quantic Aug 25 '16 at 18:14
  • 1
    @Quantic - that's because "your" empty strings are actually empty strings. Copy them to Notepad and save them as Unicode (which is actually UTF-16LE in MS lingo) - the file sizes will be the number of double quotes times two. Then copy the ostensibly empty string from the question and save the file - you'll see that the file is actually 8 bytes - 4 for the two quotes and 4 for the two invisible characters. Let me dig up what those characters actually are... – cynic Aug 25 '16 at 18:17
  • @Quantic use the "‌​" that **seems** like an empty string in question. (or try this code `var xx = "‌​".Length;)` – L.B Aug 25 '16 at 18:19
  • Ok you're right when I copy OP's `""` it has 2 invisible characters in it. To @EdPlunkett's point, how did this happen? Did OP copy and paste `""` from somewhere instead of just typing it from his keyboard into the editor? – Quantic Aug 25 '16 at 18:22
  • @EdPlunkett `var bytes = BitConverter.ToString(Encoding.UTF8.GetBytes("‌​"));` http://utf8-chartable.de/unicode-utf8-table.pl?start=8192&number=128 – L.B Aug 25 '16 at 18:26
  • for reference: the zero-width-characters are probably from copying the [top comment from this answer](https://stackoverflow.com/a/10520086/1761622). Somehow the commenter got invisible unicode characters into his "empty" string – Mikescher Jan 10 '18 at 21:52