Ignoring accented letters in string comparison

Question

I need to compare 2 strings in C# and treat accented letters the same as non-accented letters. For example:

string s1 = "hello";
string s2 = "héllo";

s1.Equals(s2, StringComparison.InvariantCultureIgnoreCase);
s1.Equals(s2, StringComparison.OrdinalIgnoreCase);

These 2 strings need to be the same (as far as my application is concerned), but both of these statements evaluate to false. Is there a way in C# to do this?

Serge Wautier · Accepted Answer · 2022-10-11T08:54:29.343

FWIW, knightfor's answer below (as of this writing) should be the accepted answer.

Here's a function that strips diacritics from a string:

static string RemoveDiacritics(string text)
{
  string formD = text.Normalize(NormalizationForm.FormD);
  StringBuilder sb = new StringBuilder();

  foreach (char ch in formD)
  {
    UnicodeCategory uc = CharUnicodeInfo.GetUnicodeCategory(ch);
    if (uc != UnicodeCategory.NonSpacingMark)
    {
      sb.Append(ch);
    }
  }

  return sb.ToString().Normalize(NormalizationForm.FormC);
}

More details on MichKap's blog (RIP...).

The principle is that is it turns 'é' into 2 successive chars 'e', acute. It then iterates through the chars and skips the diacritics.

"héllo" becomes "he<acute>llo", which in turn becomes "hello".

Debug.Assert("hello"==RemoveDiacritics("héllo"));

Note: Here's a more compact .NET4+ friendly version of the same function:

static string RemoveDiacritics(string text)
{
  return string.Concat( 
      text.Normalize(NormalizationForm.FormD)
      .Where(ch => CharUnicodeInfo.GetUnicodeCategory(ch)!=
                                    UnicodeCategory.NonSpacingMark)
    ).Normalize(NormalizationForm.FormC);
}

How to do it in .net core since it does not have `string.Normalize`? — Andre Soares, Mar 31 '17 at 11:28
Thanks for this, I wish I could upvote more than once! However, it doesn't handle all accented letters, for example ð, ħ and ø are not converted to o, h and o respectively. Is there any way to handle these as well? — Avrohom Yisroel, Aug 02 '17 at 14:07
@AvrohomYisroel the "ð" is a "Latin Small Letter Eth", which is a separate letter, not a "o-with-accent" or "d-with-accent". The others are "Latin Small Letter H With Stroke" and "Latin Small Letter O With Stroke" that may also be considered separate letters — Hans Keﬆing, Mar 01 '19 at 11:42

knightpfhor · Answer 2 · 2015-01-13T20:24:42.763

163

If you don't need to convert the string and you just want to check for equality you can use

string s1 = "hello";
string s2 = "héllo";

if (String.Compare(s1, s2, CultureInfo.CurrentCulture, CompareOptions.IgnoreNonSpace) == 0)
{
    // both strings are equal
}

or if you want the comparison to be case insensitive as well

string s1 = "HEllO";
string s2 = "héLLo";

if (String.Compare(s1, s2, CultureInfo.CurrentCulture, CompareOptions.IgnoreNonSpace | CompareOptions.IgnoreCase) == 0)
{
    // both strings are equal
}

edited Jan 13 '15 at 20:24

answered Oct 11 '11 at 02:48

knightpfhor

9,299
3
29
42

3

If anyone else is curious about this IgnoreNonSpace option, you might want to read this discussion on it. http://www.pcreview.co.uk/forums/accent-insensitive-t3924592.html TLDR; it's ok :) – Jim W Mar 06 '14 at 04:25
on msdn : "The Unicode Standard defines combining characters as characters that are combined with base characters to produce a new character. Nonspacing combining characters do not occupy a spacing position by themselves when rendered." – Avlin Apr 24 '14 at 09:15
ok this method failed for these 2 strings : tarafli / TARAFLİ however SQL server says equal as supposed to be – Furkan Gözükara Jan 12 '15 at 15:38
5

That is because generally SQL Server is configured to be case insensitive but by default comparisons in .Net are case sensitive. I've updated the answer to show how to make this case insensitive. – knightpfhor Jan 13 '15 at 20:25
I'm trying to create a IEqualityComparer. It needs to provide GetHashCode... How do you get that (it needs to be the same if it is equal) – Yepeekai Sep 04 '19 at 21:08
1

In case someone is interested for the HashCode: CultureInfo.CurrentCulture.CompareInfo.GetHashCode(obj, CompareOptions.IgnoreNonSpace | CompareOptions.IgnoreCase) – Yepeekai Sep 04 '19 at 21:27
1

Even better, with .Net Core, we can get a StringComparer : `StringComparer.Create(CultureInfo.CurrentCulture, CompareOptions.IgnoreCase | CompareOptions.IgnoreNonSpace)`. (Not available for .Net Framework, unless going through reflection.) – Frédéric Dec 21 '22 at 15:28

score 6 · Answer 3 · answered Dec 19 '13 at 16:15

I had to do something similar but with a StartsWith method. Here is a simple solution derived from @Serge - appTranslator.

Here is an extension method:

    public static bool StartsWith(this string str, string value, CultureInfo culture, CompareOptions options)
    {
        if (str.Length >= value.Length)
            return string.Compare(str.Substring(0, value.Length), value, culture, options) == 0;
        else
            return false;            
    }

And for one liners freaks ;)

    public static bool StartsWith(this string str, string value, CultureInfo culture, CompareOptions options)
    {
        return str.Length >= value.Length && string.Compare(str.Substring(0, value.Length), value, culture, options) == 0;
    }

Accent incensitive and case incensitive startsWith can be called like this

value.ToString().StartsWith(str, CultureInfo.InvariantCulture, CompareOptions.IgnoreNonSpace | CompareOptions.IgnoreCase)

Ryan Cook · Answer 4 · 2008-12-11T17:06:21.613

The following method CompareIgnoreAccents(...) works on your example data. Here is the article where I got my background information: http://www.codeproject.com/KB/cs/EncodingAccents.aspx

private static bool CompareIgnoreAccents(string s1, string s2)
{
    return string.Compare(
        RemoveAccents(s1), RemoveAccents(s2), StringComparison.InvariantCultureIgnoreCase) == 0;
}

private static string RemoveAccents(string s)
{
    Encoding destEncoding = Encoding.GetEncoding("iso-8859-8");

    return destEncoding.GetString(
        Encoding.Convert(Encoding.UTF8, destEncoding, Encoding.UTF8.GetBytes(s)));
}

I think an extension method would be better:

public static string RemoveAccents(this string s)
{
    Encoding destEncoding = Encoding.GetEncoding("iso-8859-8");

    return destEncoding.GetString(
        Encoding.Convert(Encoding.UTF8, destEncoding, Encoding.UTF8.GetBytes(s)));
}

Then the use would be this:

if(string.Compare(s1.RemoveAccents(), s2.RemoveAccents(), true) == 0) {
   ...

This is a destructive comparison, where for instance ā and ē will be treated as equal. You loose any characters above 0xFF and there's no guarantee that the strings are equal-ignoring-accents. — Abel, May 07 '13 at 15:18
You lose as well things like ñ. Not a solution if you ask me. — Ignacio Soler Garcia, Feb 02 '16 at 08:45

score 1 · Answer 5 · answered Sep 01 '14 at 13:05

1

A more simple way to remove accents:

    Dim source As String = "áéíóúç"
    Dim result As String

    Dim bytes As Byte() = Encoding.GetEncoding("Cyrillic").GetBytes(source)
    result = Encoding.ASCII.GetString(bytes)

answered Sep 01 '14 at 13:05

Newton Carlos Dantas

13
2

score -4 · Answer 6 · 2008-12-11T16:15:40.280

try this overload on the String.Compare Method.

String.Compare Method (String, String, Boolean, CultureInfo)

It produces a int value based on the compare operations including cultureinfo. the example in the page compares "Change" in en-US and en-CZ. CH in en-CZ is a single "letter".

example from the link

using System;
using System.Globalization;

class Sample {
    public static void Main() {
    String str1 = "change";
    String str2 = "dollar";
    String relation = null;

    relation = symbol( String.Compare(str1, str2, false, new CultureInfo("en-US")) );
    Console.WriteLine("For en-US: {0} {1} {2}", str1, relation, str2);

    relation = symbol( String.Compare(str1, str2, false, new CultureInfo("cs-CZ")) );
    Console.WriteLine("For cs-CZ: {0} {1} {2}", str1, relation, str2);
    }

    private static String symbol(int r) {
    String s = "=";
    if      (r < 0) s = "<";
    else if (r > 0) s = ">";
    return s;
    }
}
/*
This example produces the following results.
For en-US: change < dollar
For cs-CZ: change > dollar
*/

therefor for accented languages you will need to get the culture then test the strings based on that.

http://msdn.microsoft.com/en-us/library/hyxc48dt.aspx

This is a better approach than directly comparing the strings, but it still considers the base letter and its accented version *different*. Therefore it doesn't answer the original question, which wanted accents to be ignored. — C.B., May 15 '13 at 14:43

Ignoring accented letters in string comparison

6 Answers6

Linked

Related