9

I often use Char.IsDigit to check if a char is a digit which is especially handy in LINQ queries to pre-check int.Parse as here: "123".All(Char.IsDigit).

But there are chars which are digits but which can't be parsed to int like ۵.

// true
bool isDigit = Char.IsDigit('۵'); 

var cultures = CultureInfo.GetCultures(CultureTypes.SpecificCultures);
int num;
// false
bool isIntForAnyCulture = cultures
    .Any(c => int.TryParse('۵'.ToString(), NumberStyles.Any, c, out num)); 

Why is that? Is my int.Parse-precheck via Char.IsDigit thus incorrect?

There are 310 chars which are digits:

List<char> digitList = Enumerable.Range(0, UInt16.MaxValue)
   .Select(i => Convert.ToChar(i))
   .Where(c => Char.IsDigit(c))
   .ToList(); 

Here's the implementation of Char.IsDigit in .NET 4 (ILSpy):

public static bool IsDigit(char c)
{
    if (char.IsLatin1(c))
    {
        return c >= '0' && c <= '9';
    }
    return CharUnicodeInfo.GetUnicodeCategory(c) == UnicodeCategory.DecimalDigitNumber;
}

So why are there chars that belong to the DecimalDigitNumber-category("Decimal digit character, that is, a character in the range 0 through 9...") which can't be parsed to an int in any culture?

Tim Schmelter
  • 450,073
  • 74
  • 686
  • 939

2 Answers2

9

It's because it is checking for all digits in the Unicode "Number, Decimal Digit" category, as listed here:

http://www.fileformat.info/info/unicode/category/Nd/list.htm

It doesn't mean that it is a valid numeric character in the current locale. In fact using int.Parse(), you can ONLY parse the normal English digits, regardless of the locale setting.

For example, this doesn't work:

int test = int.Parse("٣", CultureInfo.GetCultureInfo("ar"));

Even though ٣ is a valid Arabic digit character, and "ar" is the Arabic locale identifier.

The Microsoft article "How to: Parse Unicode Digits" states that:

The only Unicode digits that the .NET Framework parses as decimals are the ASCII digits 0 through 9, specified by the code values U+0030 through U+0039. The .NET Framework parses all other Unicode digits as characters.

However, note that you can use char.GetNumericValue() to convert a unicode numeric character to its numeric equivalent as a double.

The reason the return value is a double and not an int is because of things like this:

Console.WriteLine(char.GetNumericValue('¼')); // Prints 0.25

You could use something like this to convert all numeric characters in a string into their ASCII equivalent:

public string ConvertNumericChars(string input)
{
    StringBuilder output = new StringBuilder();

    foreach (char ch in input)
    {
        if (char.IsDigit(ch))
        {
            double value = char.GetNumericValue(ch);

            if ((value >= 0) && (value <= 9) && (value == (int)value))
            {
                output.Append((char)('0'+(int)value));
                continue;
            }
        }

        output.Append(ch);
    }

    return output.ToString();
}
Matthew Watson
  • 104,400
  • 10
  • 158
  • 276
  • 1
    Does that mean that `int.Parse`(or the other numeric `parse`-methods) doesn't care about the culture regarding the characters? Is the culture used only to identify f.e. the decimal-separator? That sounds as if `int.Parse` is less correct than `Char.IsDigit`. – Tim Schmelter Feb 27 '14 at 09:20
  • @TimSchmelter That is a very good question! It's answered [here](http://msdn.microsoft.com/en-us/library/w1c0s6bb.aspx): `The only Unicode digits that the .NET Framework parses as decimals are the ASCII digits 0 through 9` – Matthew Watson Feb 27 '14 at 09:39
  • `Decimal.Parse` also fails on every character `> 9` even if "How to: Parse Unicode Digits" suggests that it would work. So there is no way to parse these unicode digits in the .NET framework? – Tim Schmelter Feb 27 '14 at 09:56
  • That might work with a single char. But i have no idea if it works also with a string. – Tim Schmelter Feb 27 '14 at 10:11
  • @TimSchmelter It doesn't - you have to do a horrible clunky pass through the entire string replacing characters as necessary (as per my example in the answer above) – Matthew Watson Feb 27 '14 at 10:23
3

Decimal digits are 0 to 9, but they have many representations in Unicode. From Wikipedia:

The decimal digits are repeated in 23 separate blocks

MSDN specifies that .NET only parses Latin numerals:

However, the only numeric digits recognized by parsing methods are the basic Latin digits 0-9 with code points from U+0030 to U+0039

Eli Arbel
  • 22,391
  • 3
  • 45
  • 71