How to compare Unicode characters that "look alike"?

Question

I fall into a surprising issue.

I loaded a text file in my application and I have some logic which compares the value having µ.

And I realized that even if the texts are same the compare value is false.

 Console.WriteLine("μ".Equals("µ")); // returns false
 Console.WriteLine("µ".Equals("µ")); // return true

In later line the character µ is copy pasted.

However, these might not be the only characters that are like this.

Is there any way in C# to compare the characters which look the same but are actually different?

They are different characters - even though they look the same, they have different character codes. — user2864740, Dec 19 '13 at 05:50
Have you tried using String.Compare("μ", "μ", StringComparison.Ordinal) (or OrdinalIgnoreCase)? I ask because if you do a straight comparison (non-ordinal), then the characters will always be expanded, since, the way the character is expanded can vary, you may see different results. — David Venegoni, Dec 19 '13 at 05:51
I can _visually_ tell the difference between the two characters; one is narrower. — Michael Hampton, Dec 19 '13 at 05:51
@user2864740 Thanks.. thats the only solution I can try for now. But still there might be some other characters also similar to this. Is there any way to change them to same code and then compare. — D J, Dec 19 '13 at 05:58
David Venegoni: Just tested Ordinal and Invariant; still register as different. Also, you used the same mu sign twice. — Arithmomaniac, Dec 19 '13 at 05:59
what do you want to achieve? that those two should be equal then even their character code is different but the same face? — Jade, Dec 19 '13 at 06:18
“Look alike” and “look the same” are vague concepts. Do they mean identity of glyphs, or just close similarity? How close? Note that two characters may have identical glyphs in some font, very similar in another, and quite dissimilar in yet another font. What matters is *why* you would do such a comparison and in which context (and the acceptability of false positives and false negatives). — Jukka K. Korpela, Dec 19 '13 at 07:59
I wonder if these (Unicode symbols that look same) could lead to some kind of attack ... — Tanmoy, Dec 19 '13 at 10:29
@Tanmoy - yep: http://en.wikipedia.org/wiki/IDN_homograph_attack — Jac, Dec 19 '13 at 10:59
@ everyone discussing whether they look the same or not: You realize it depends on the font type your browser or text editor is using, right? I seem to have one of the better fonts that differentiate between the two characters (there is a small serif on the mu). — Perseids, Dec 19 '13 at 11:16
@ta.speot.is It seems fashionable at the moment to blame Unicode for this crap but this is neither Unicode’s fault nor even specific to Unicode. Rather, it’s simply a consequence of letters with different meanings having similar (or, as here, identical) renderings. — Konrad Rudolph, Dec 19 '13 at 11:44
@Smileek: Actually, the issue is that C# *does* see sharp enough. The two mus are different. — Luaan, Dec 19 '13 at 17:12
@Luaan, I believed to dimension10 - [he compared it pixel by pixel](http://stackoverflow.com/questions/20674300/how-to-compare-and-in-c-sharp?noredirect=1#comment30963612_20674300). I... you know... believe in humans... — Smileek, Dec 20 '13 at 07:28
@Smileek That might be so, but the truth is that pixels are not representing the character, they're just a projection to a display device, that approximates the character. And even the visual representation is font dependent - for example, in the dreaded Comic Sans, the two are very much different (probably because CS is missing one of them :D). — Luaan, Dec 20 '13 at 08:47
There is no requirement that fonts render the two characters identically. It may so happen to be the case on your computer, but they can look different depending on the set of fonts on your system, which one your browser chooses to render each character in, etc. — ShreevatsaR, Dec 20 '13 at 12:40

Tony · Answer 1 · 2013-12-24T23:57:27.587

151

Because it is really different symbols even they look the same, first is the actual letter and has char code = 956 (0x3BC) and the second is the micro sign and has 181 (0xB5).

References:

So if you want to compare them and you need them to be equal, you need to handle it manually, or replace one char with another before comparison. Or use the following code:

public void Main()
{
    var s1 = "μ";
    var s2 = "µ";

    Console.WriteLine(s1.Equals(s2));  // false
    Console.WriteLine(RemoveDiacritics(s1).Equals(RemoveDiacritics(s2))); // true 
}

static string RemoveDiacritics(string text) 
{
    var normalizedString = text.Normalize(NormalizationForm.FormKC);
    var stringBuilder = new StringBuilder();

    foreach (var c in normalizedString)
    {
        var unicodeCategory = CharUnicodeInfo.GetUnicodeCategory(c);
        if (unicodeCategory != UnicodeCategory.NonSpacingMark)
        {
            stringBuilder.Append(c);
        }
    }

    return stringBuilder.ToString().Normalize(NormalizationForm.FormC);
}

And the Demo

edited Dec 24 '13 at 23:57

answered Dec 19 '13 at 05:52

Tony

7,345
3
26
34

11

Out of curiosity, what is the reasoning for having two µ symbols? You don't see a dedicated K with the name "Kilo sign" (or do you?). – MartinHaTh Dec 19 '13 at 12:23
12

@MartinHaTh: According to Wikipedia, it's ["for historical reasons"](http://en.wikipedia.org/wiki/Micro-#Symbol_encoding_in_character_sets). – BoltClock Dec 19 '13 at 12:43
12

Unicode has a lot of compatibility characters brought over from older character sets (like [ISO 8859-1](http://en.wikipedia.org/wiki/ISO_8859-1)), to make conversion from those character sets easier. Back when character sets were constrained to 8 bits, they would include a few glyphs (like some Greek letters) for the most common math and scientific uses. Glyph reuse based on appearance was common, so no specialized 'K' was added. But it was always a workaround; the correct symbol for "micro" is the actual Greek lowercase mu, the correct symbol for Ohm is the actual capital omega, and so on. – VGR Dec 19 '13 at 12:49
1

Although there is a specialized K for Kelvin (temperature) – Oliver Hallam Dec 19 '13 at 19:04
8

Nothing better than when something is done for hysterical raisins – paulm Dec 19 '13 at 20:31
11

Is there a special K for cereal? – Dec 20 '13 at 05:10
3

A special case of *micro-optimization*. – Chris W. Rea Dec 20 '13 at 15:47
@MartinHaTh imagine in 60 years people decide that the greek alphabet is not international enough, or a particular country decides that they don't want to use foreign characters for scientific notation (say they replace µ with 小) you can now change your encoding without breaking everything and having to remap for greek characters. In short Micro is encoded as Micro, whether it's displayed as a Mu or not. – AncientSwordRage Jan 03 '14 at 09:36
1

@Pureferret: I wonder what would have happened if Unicode had defined characters for "decimal unity point" and "visual digit separator", and the visual appearance of those characters was controlled by the user's locale. Then a number which was formatted using those characters would display properly in many locales, and--more importantly--could be unambiguously converted back to a number in any locale, even if it was formatted in a different one. – supercat Aug 20 '14 at 18:18
1

I will always wonder why Unicode doesn't have a MATHEMATICAL SYMBOL PI for historical reasons. This would have been a numerical symbol, but instead we have to use a Greek letter as a workaround. – Mr Lister Jan 05 '18 at 16:15
I realise this is a very esoteric case, but there is a large issue with lookalike characters. Scammers can use them to beat spam filters, to impersonate companies in websites, email, certificates etc. So there is a real threat from this sort of thing. – locka Jul 21 '23 at 08:29

score 127 · Accepted Answer · edited May 23 '17 at 12:32

In many cases, you can normalize both of the Unicode characters to a certain normalization form before comparing them, and they should be able to match. Of course, which normalization form you need to use depends on the characters themselves; just because they look alike doesn't necessarily mean they represent the same character. You also need to consider if it's appropriate for your use case — see Jukka K. Korpela's comment.

For this particular situation, if you refer to the links in Tony's answer, you'll see that the table for U+00B5 says:

Decomposition <compat> GREEK SMALL LETTER MU (U+03BC)

This means U+00B5, the second character in your original comparison, can be decomposed to U+03BC, the first character.

So you'll normalize the characters using full compatibility decomposition, with the normalization forms KC or KD. Here's a quick example I wrote up to demonstrate:

using System;
using System.Text;

class Program
{
    static void Main(string[] args)
    {
        char first = 'μ';
        char second = 'µ';

        // Technically you only need to normalize U+00B5 to obtain U+03BC, but
        // if you're unsure which character is which, you can safely normalize both
        string firstNormalized = first.ToString().Normalize(NormalizationForm.FormKD);
        string secondNormalized = second.ToString().Normalize(NormalizationForm.FormKD);

        Console.WriteLine(first.Equals(second));                     // False
        Console.WriteLine(firstNormalized.Equals(secondNormalized)); // True
    }
}

For details on Unicode normalization and the different normalization forms refer to System.Text.NormalizationForm and the Unicode spec.

Thanks for the Unicode spec link. First time I ever read up on it. Small note from it: "Normalization Forms KC and KD must not be blindly applied to arbitrary text .. It is best to think of these Normalization Forms as being like uppercase or lowercase mappings: useful in certain contexts for identifying core meanings, but also performing modifications to the text that may not always be appropriate." — user2864740, Dec 19 '13 at 06:57

Vishal Suthar · Answer 3 · 2013-12-19T06:09:37.553

They both have different character codes: Refer this for more details

Console.WriteLine((int)'μ');  //956
Console.WriteLine((int)'µ');  //181

Where, 1st one is:

Display     Friendly Code   Decimal Code    Hex Code    Description
====================================================================
μ           &mu;            &#956;          &#x3BC;     Lowercase Mu
µ           &micro;         &#181;          &#xB5;      micro sign Mu

score 40 · Answer 4 · answered Dec 19 '13 at 06:58

40

For the specific example of μ (mu) and µ (micro sign), the latter has a compatibility decomposition to the former, so you can normalize the string to FormKC or FormKD to convert the micro signs to mus.

However, there are lots of sets of characters that look alike but aren't equivalent under any Unicode normalization form. For example, A (Latin), Α (Greek), and А (Cyrillic). The Unicode website has a confusables.txt file with a list of these, intended to help developers guard against homograph attacks. If necessary, you could parse this file and build a table for “visual normalization” of strings.

answered Dec 19 '13 at 06:58

dan04

87,747
23
163
198

Definitely good to know when using Normalize. It seems surprising that they remain distinct. – user2864740 Dec 19 '13 at 07:09
4

@user2864740: If an uppercase Greek tau didn't remain distinct from a Roman letter T, it would be very difficult to have Greek and Roman text sort sensibly into alphabetic order. Further, if a typeface were to use a different visual style for Greek and Roman letters, it would be very distracting if the Greek letters whose shapes resembled Roman letters were rendered differently from those which didn't. – supercat Dec 19 '13 at 08:18
8

More importantly, unifying the European alphabets would make `ToUpper` / `ToLower` difficult to implement. You'd need to have `"B".ToLower()` be `b` in English but `β` in Greek and `в` in Russian. As it is, only Turkish (dotless `i`) and a couple of other languages need casing rules different from the default. – dan04 Dec 19 '13 at 08:49
@dan04: I wonder if anyone ever considered assigning unique code points to all four variations of the Turkish "i" and "I"? That would have eliminated any ambiguity in the behavior of toUpper/toLower. – supercat Aug 20 '14 at 18:20

score 36 · Answer 5 · edited Dec 31 '13 at 16:55

36

Search both characters in a Unicode database and see the difference.

One is the Greek small Letter µ and the other is the Micro Sign µ.

Name            : MICRO SIGN
Block           : Latin-1 Supplement
Category        : Letter, Lowercase [Ll]
Combine         : 0
BIDI            : Left-to-Right [L]
Decomposition   : <compat> GREEK SMALL LETTER MU (U+03BC)
Mirror          : N
Index entries   : MICRO SIGN
Upper case      : U+039C
Title case      : U+039C
Version         : Unicode 1.1.0 (June, 1993)

Name            : GREEK SMALL LETTER MU
Block           : Greek and Coptic
Category        : Letter, Lowercase [Ll]
Combine         : 0
BIDI            : Left-to-Right [L]
Mirror          : N
Upper case      : U+039C
Title case      : U+039C
See Also        : micro sign U+00B5
Version         : Unicode 1.1.0 (June, 1993)

edited Dec 31 '13 at 16:55

TRiG

10,148
7
57
107

answered Dec 19 '13 at 05:58

Subin Jacob

4,692
10
37
69

4

How did this get 37 upvotes? It does not answer the question ("How to compare unicode characters"), it just comments on why this particular example is not equal. At best, it should be a comment on the question. I understand comment formatting options do not allow to post it as nicely as answer formatting options do, but that should not be a valid reason to post as an answer. – Konerak Dec 30 '13 at 13:38
6

Actually the question was a different one, asking why μ and µ equality check return false. This Answer answer it. Later OP asked another question (this question) how to compare two characters that look alike. Both questions had best answers and later one of the moderator merged both questions selecting best answer of the second one as best. Someone edited this question, so that it will summarize – Subin Jacob Dec 31 '13 at 04:22
Actually, I didn't add any content after the merge – Subin Jacob Dec 31 '13 at 04:24

score 24 · Answer 6 · edited May 23 '17 at 12:25

EDIT After the merge of this question with How to compare 'μ' and 'µ' in C#
Original answer posted:

 "μ".ToUpper().Equals("µ".ToUpper()); //This always return true.

EDIT After reading the comments, yes it is not good to use the above method because it may provide wrong results for some other type of inputs, for this we should use normalize using full compatibility decomposition as mentioned in wiki. (Thanks to the answer posted by BoltClock)

    static string GREEK_SMALL_LETTER_MU = new String(new char[] { '\u03BC' });
    static string MICRO_SIGN = new String(new char[] { '\u00B5' });

    public static void Main()
    {
        string Mus = "µμ";
        string NormalizedString = null;
        int i = 0;
        do
        {
            string OriginalUnicodeString = Mus[i].ToString();
            if (OriginalUnicodeString.Equals(GREEK_SMALL_LETTER_MU))
                Console.WriteLine(" INFORMATIO ABOUT GREEK_SMALL_LETTER_MU");
            else if (OriginalUnicodeString.Equals(MICRO_SIGN))
                Console.WriteLine(" INFORMATIO ABOUT MICRO_SIGN");

            Console.WriteLine();
            ShowHexaDecimal(OriginalUnicodeString);                
            Console.WriteLine("Unicode character category " + CharUnicodeInfo.GetUnicodeCategory(Mus[i]));

            NormalizedString = OriginalUnicodeString.Normalize(NormalizationForm.FormC);
            Console.Write("Form C Normalized: ");
            ShowHexaDecimal(NormalizedString);               

            NormalizedString = OriginalUnicodeString.Normalize(NormalizationForm.FormD);
            Console.Write("Form D Normalized: ");
            ShowHexaDecimal(NormalizedString);               

            NormalizedString = OriginalUnicodeString.Normalize(NormalizationForm.FormKC);
            Console.Write("Form KC Normalized: ");
            ShowHexaDecimal(NormalizedString);                

            NormalizedString = OriginalUnicodeString.Normalize(NormalizationForm.FormKD);
            Console.Write("Form KD Normalized: ");
            ShowHexaDecimal(NormalizedString);                
            Console.WriteLine("_______________________________________________________________");
            i++;
        } while (i < 2);
        Console.ReadLine();
    }

    private static void ShowHexaDecimal(string UnicodeString)
    {
        Console.Write("Hexa-Decimal Characters of " + UnicodeString + "  are ");
        foreach (short x in UnicodeString.ToCharArray())
        {
            Console.Write("{0:X4} ", x);
        }
        Console.WriteLine();
    }

Output

INFORMATIO ABOUT MICRO_SIGN    
Hexa-Decimal Characters of µ  are 00B5
Unicode character category LowercaseLetter
Form C Normalized: Hexa-Decimal Characters of µ  are 00B5
Form D Normalized: Hexa-Decimal Characters of µ  are 00B5
Form KC Normalized: Hexa-Decimal Characters of µ  are 03BC
Form KD Normalized: Hexa-Decimal Characters of µ  are 03BC
 ________________________________________________________________
 INFORMATIO ABOUT GREEK_SMALL_LETTER_MU    
Hexa-Decimal Characters of µ  are 03BC
Unicode character category LowercaseLetter
Form C Normalized: Hexa-Decimal Characters of µ  are 03BC
Form D Normalized: Hexa-Decimal Characters of µ  are 03BC
Form KC Normalized: Hexa-Decimal Characters of µ  are 03BC
Form KD Normalized: Hexa-Decimal Characters of µ  are 03BC
 ________________________________________________________________

While reading information in Unicode_equivalence I found

The choice of equivalence criteria can affect search results. For instance some typographic ligatures like U+FB03 (ﬃ), ..... so a search for U+0066 (f) as substring would succeed in an NFKC normalization of U+FB03 but not in NFC normalization of U+FB03.

So to compare equivalence we should normally use FormKC i.e. NFKC normalization or FormKD i.e NFKD normalization.
I was little curious to know more about all the Unicode characters so I made sample which would iterate over all the Unicode character in UTF-16 and I got some results I want to discuss

Information about characters whose FormC and FormD normalized values were not equivalent
Total: 12,118
Character (int value): 192-197, 199-207, 209-214, 217-221, 224-253, ..... 44032-55203
Information about characters whose FormKC and FormKD normalized values were not equivalent
Total: 12,245
Character (int value): 192-197, 199-207, 209-214, 217-221, 224-228, ..... 44032-55203, 64420-64421, 64432-64433, 64490-64507, 64512-64516, 64612-64617, 64663-64667, 64735-64736, 65153-65164, 65269-65274
All the character whose FormC and FormD normalized value were not equivalent, there FormKC and FormKD normalized values were also not equivalent except these characters
Characters: 901 '΅', 8129 '῁', 8141 '῍', 8142 '῎', 8143 '῏', 8157 '῝', 8158 '῞'
, 8159 '῟', 8173 '῭', 8174 '΅'
Extra character whose FormKC and FormKD normalized value were not equivalent, but there FormC and FormD normalized values were equivalent
Total: 119
Characters: 452 'Ǆ' 453 'ǅ' 454 'ǆ' 12814 '㈎' 12815 '㈏' 12816 '㈐' 12817 '㈑' 12818 '㈒' 12819 '㈓' 12820 '㈔' 12821 '㈕', 12822 '㈖' 12823 '㈗' 12824 '㈘' 12825 '㈙' 12826 '㈚' 12827 '㈛' 12828 '㈜' 12829 '㈝' 12830 '㈞' 12910 '㉮' 12911 '㉯' 12912 '㉰' 12913 '㉱' 12914 '㉲' 12915 '㉳' 12916 '㉴' 12917 '㉵' 12918 '㉶' 12919 '㉷' 12920 '㉸' 12921 '㉹' 12922 '㉺' 12923 '㉻' 12924 '㉼' 12925 '㉽' 12926 '㉾' 13056 '㌀' 13058 '㌂' 13060 '㌄' 13063 '㌇' 13070 '㌎' 13071 '㌏' 13072 '㌐' 13073 '㌑' 13075 '㌓' 13077 '㌕' 13080 '㌘' 13081 '㌙' 13082 '㌚' 13086 '㌞' 13089 '㌡' 13092 '㌤' 13093 '㌥' 13094 '㌦' 13099 '㌫' 13100 '㌬' 13101 '㌭' 13102 '㌮' 13103 '㌯' 13104 '㌰' 13105 '㌱' 13106 '㌲' 13108 '㌴' 13111 '㌷' 13112 '㌸' 13114 '㌺' 13115 '㌻' 13116 '㌼' 13117 '㌽' 13118 '㌾' 13120 '㍀' 13130 '㍊' 13131 '㍋' 13132 '㍌' 13134 '㍎' 13139 '㍓' 13140 '㍔' 13142 '㍖' .......... ﺋ' 65164 'ﺌ' 65269 'ﻵ' 65270 'ﻶ' 65271 'ﻷ' 65272 'ﻸ' 65273 'ﻹ' 65274'
There are some characters which can not be normalized, they throw ArgumentException if tried
Total:2081 Characters(int value): 55296-57343, 64976-65007, 65534

This links can be really helpful to understand what rules govern for Unicode equivalence

Strange but works... I mean they are two different chars with different meanings and convert them to upper makes them equal? I dont see the logic but nice solution +1 — BudBrot, Dec 19 '13 at 07:36
This solution masks the problem, and could cause issues in a general case. This sort of test would find that `"m".ToUpper().Equals("µ".ToUpper());` and `"M".ToUpper().Equals("µ".ToUpper());` are also true. This may not be desirable. — Andrew Leach, Dec 19 '13 at 08:34
-1 – this is a terrible idea. Do not work with Unicode like this. — Konrad Rudolph, Dec 19 '13 at 11:45
Instead of ToUpper()-based tricks, why not use String.Equals("μ", "μ", StringComparison.CurrentCultureIgnoreCase)? — svenv, Dec 19 '13 at 12:08
There is one good reason to distinguish between "MICRO SIGN" and "GREEK SMALL LETTER MU" - to say that "uppercase" of micro sign is still micro sign. But capitalization changes micro to mega, happy engineering. — Greg, Dec 20 '13 at 09:49
@Greg great one Capitalization of MICRO changes it to MEGA(924) — Deepak Bhatia, Dec 20 '13 at 10:29
@Pengu There is always a logic associated with all the thing that happen in computer nothing is unknown, the logic behind them is that they are converted to there defined Uppercase letter which points to 'M' (924 MEGHA) as symbols are known as mu and micro — Deepak Bhatia, Dec 20 '13 at 10:30

score 9 · Answer 7 · answered Dec 19 '13 at 05:52

Most likely, there are two different character codes that make (visibly) the same character. While technically not equal, they look equal. Have a look at the character table and see whether there are multiple instances of that character. Or print out the character code of the two chars in your code.

hippietrail · Answer 8 · 2013-12-20T03:37:02.157

6

You ask "how to compare them" but you don't tell us what you want to do.

There are at least two main ways to compare them:

Either you compare them directly as you are and they are different

Or you use Unicode Compatibility Normalization if your need is for a comparison that finds them to match.

There could be a problem though because Unicode compatibility normalization will make many other characters compare equal. If you want only these two characters to be treated as alike you should roll your own normalization or comparison functions.

For a more specific solution we need to know your specific problem. What is the context under which you came across this problem?

edited Dec 20 '13 at 03:37

answered Dec 19 '13 at 12:24

hippietrail

15,848
18
99
158

1

Are the "micro sign" and the lowercase mu character canonically equivalent? Using canonical normalization would give you a more strict comparison. – Tanner Swett Dec 19 '13 at 18:52
@TannerL.Swett: Actually I'm not even sure how to check that off the top of my head ... – hippietrail Dec 19 '13 at 19:08
1

Actually, I was importing a file with physics formula. You are right about normalization. I have to go through it more deeply.. – D J Dec 19 '13 at 23:40
What kind of file? Something hand-made in plain Unicode text by a person? Or something output by an app in a specific format? – hippietrail Dec 20 '13 at 03:36

score 5 · Answer 9 · answered Dec 19 '13 at 06:35

If I would like to be pedantic, I would say that your question doesn't make sense, but since we are approaching christmas and the birds are singing, I'll proceed with this.

First off, the 2 entities that you are trying to compare are glyphs, a glyph is part of a set of glyphs provided by what is usually know as a "font", the thing that usually comes in a ttf, otf or whatever file format you are using.

The glyphs are a representation of a given symbol, and since they are a representation that depends on a specific set, you can't just expect to have 2 similar or even "better" identical symbols, it's a phrase that doesn't make sense if you consider the context, you should at least specify what font or set of glyphs you are considering when you formulate a question like this.

What is usually used to solve a problem similar to the one that you are encountering, it's an OCR, essentially a software that recognize and compares glyphs, If C# provides an OCR by default I don't know that, but it's generally a really bad idea if you don't really need an OCR and you know what to do with it.

You can possibly end up interpreting a physics book as an ancient greek book without mentioning the fact that OCR are generally expensive in terms of resources.

There is a reason why those characters are localized the way they are localized, just don't do that.

score 2 · Answer 10 · answered Jan 24 '14 at 15:46

2

It's possible to draw both of chars with the same font style and size with DrawString method. After two bitmaps with symbols has been generated, it's possible to compare them pixel by pixel.

Advantage of this method is that you can compare not only absolute equal charcters, but similar too (with definite tolerance).

answered Jan 24 '14 at 15:46

Ivan Kochurkin

4,413
8
45
80

This answer is nonsense. If you have a list of hundreds of string this will be EXTREMELY slow. – Elmue Mar 15 '21 at 21:07

How to compare Unicode characters that "look alike"?

10 Answers10

Linked

Related