Why does Apache Commons consider '१२३' numeric?

Question

According to Apache Commons Lang's documentation for StringUtils.isNumeric(), the String '१२३' is numeric.

Since I believed this might be a mistake in the documentation, I ran tests to verify the statement. I found that according to Apache Commons it is numeric.

Why is this String numeric? What do those characters represent?

Perhaps they represent digits in some language. Not all languages use the symbols 0 to 9 to represent digits. — Eran, Oct 20 '16 at 08:00
You can get the integer value by `Integer.parseInt("१२३")`. — , Oct 20 '16 at 08:27
`"ⅯⅭⅯⅬⅩⅩⅩⅤ".isnumeric()` is also True (in Python, but presumably in Java too), as is `"⅕".isnumeric()` — gerrit, Oct 20 '16 at 13:48
step 1: realize that those characters are not in your alphabet. step 2: realize that it is probably a different language. step 3: let google translate figure it out: https://translate.google.com/#auto/en/%E0%A5%A7%E0%A5%A8%E0%A5%A9 — njzk2, Oct 20 '16 at 16:56
@gerrit: But what about `"π".isnumeric()` or `"e".isnumeric()`? — dan04, Oct 20 '16 at 23:20
@dan04 Those are not numbers, those are letters that are popular to represent particular constants. Note the difference between `ⅯⅭ` and `MC`. — gerrit, Oct 21 '16 at 01:17
That's why [\d is less efficient than \[0-9\]](http://stackoverflow.com/q/16621738/995714). [Should I use \d or \[0-9\] to match digits in a Perl regex?](http://stackoverflow.com/q/890686/995714) — phuclv, Oct 21 '16 at 04:46
@LưuVĩnhPhúc Not in Java. In Java, `\d` is a synonym for `[0-9]`. It won't match the Devanagari digits. — Dawood ibn Kareem, Oct 21 '16 at 09:44
If you're using Firefox, get the Identify Characters extension! — Anton Sherwood, Oct 22 '16 at 03:57
@AntonSherwood Yes, and so are Marathi, Bhojpuri, Awadhi, Magahi, Maithili, Nepali, Pali, Konkani, Bodo, Sindhi and Sanskrit and many more. Devanagari is a script, like Latin, Hindi, Marathi are languages like English. — Ashish Patil, Oct 22 '16 at 06:18
@AshishPatil So how can Sujan say it's not Hindi (rather than “it's not necessarily Hindi”)? — Anton Sherwood, Oct 22 '16 at 06:33
its basically `sanskrit`, `0` was invented in this language, if you do a simple google search on sanskrit numbers you will get this check this for reference (http://www.2indya.com/2011/06/22/sanskrit-counting-1-to-100/) — bananas, Oct 22 '16 at 09:20
GNU Calculator (Linux graphic app) also recognize it as numeric (however the result is showed int Arabic Numerals): १२३+0=123; १२३+100=223; १२३+0=123; १२३+123=246 — Luciano, Oct 31 '16 at 13:25

score 200 · Accepted Answer · edited Jun 20 '20 at 09:12

200

Because that "CharSequence contains only Unicode digits" (quoting your linked documentation).

All of the characters return true for Character.isDigit:

Some Unicode character ranges that contain digits:

'\u0030' through '\u0039', ISO-LATIN-1 digits ('0' through '9')

'\u0660' through '\u0669', Arabic-Indic digits

'\u06F0' through '\u06F9', Extended Arabic-Indic digits

'\u0966' through '\u096F', Devanagari digits

'\uFF10' through '\uFF19', Fullwidth digits

Many other character ranges contain digits as well.

१२३ are Devanagari digits:

edited Jun 20 '20 at 09:12

Community

1
1

answered Oct 20 '16 at 08:03

Andy Turner

137,514
11
162
243

11

@Joker_vD well, you've not specified which overload, so yes, sure: [`Integer.parseInt("222", 2)`](http://ideone.com/xDtkYY). – Andy Turner Oct 20 '16 at 10:47
4

@Joker_vD It's not even hard; there are many unsupported languages. Even if so, there's the Chinise `亿`, which represents 10^8 -> this to the power of 3 would cause an overflow. *[List of numeral systems](https://en.wikipedia.org/wiki/Numerical_digit#Numerals_in_most_popular_systems)* – Cedric Reichenbach Oct 20 '16 at 13:48
1

`Integer.parseInt()` will probably fail if digits are not meant to be consecutive (like the japanese numbers 1, 2, 3, ...) – Jean-François Fabre Oct 20 '16 at 17:40
13

@CedricReichenbach: The key distinction there is that while 亿 is *numeric* (by the standards of having one of the non-None values of Numeric_Type, in this case Numeric_Type=Numeric), it's not any sort of *digit*. (Even if it were, you wouldn't take it to the power of 3; you would raise the *radix* to various powers, not the *digits*.) `parseInt` requires digits, and perhaps confusingly, the `isNumeric` method in this question tests for decimal digit characters (General_Category=Decimal_Number) instead of any broader category of numeric characters. – user2357112 Oct 20 '16 at 19:48
11

The complete set of Devangari digits is `०१२३४५६७८९`. – dan04 Oct 20 '16 at 23:28
1

What did Joker_vD say? – v7d8dpo4 Oct 21 '16 at 08:56
2

@v7d8dpo4 (s)he asked if there was a way to get `Integer.parseInt()` to throw an exception for a 3-character numeric input string. – Andy Turner Oct 21 '16 at 08:57

score 59 · Answer 2 · edited Sep 25 '17 at 05:22

59

The symbol १२३ is the same as 123 for the Nepali language or any other language using the Devanagari script such as Hindi, Gujarati, and so on, and is therefore is a number for Apache Commons.

edited Sep 25 '17 at 05:22

cs95

379,657
97
704
746

answered Oct 20 '16 at 08:01

ΦXocę 웃 Пepeúpa ツ

47,427
17
69
97

3

That thing almost looks like "123" in Arabic numerals. – Panzercrisis Oct 21 '16 at 18:50
42

Arabs got their numerals from Indians. – Oct 21 '16 at 20:31
5

@rahul Arabic numbers are 1-9, not ١-٩ as commonly thought. – Maroun Oct 22 '16 at 06:50

Maroun · Answer 3 · 2016-10-21T11:13:50.310

You can use Character#getType to check the character's general category:

System.out.println(Character.DECIMAL_DIGIT_NUMBER == Character.getType('१'));

This will print true, which is an "evidence" that '१' is a digit number.

Now let's examine the unicode value of the '१' character:

System.out.println(Integer.toHexString('१'));
// 967

This number is on the range of Devanagari digits - which is: \u0966 through \u096F.

Also try:

Character.UnicodeBlock block = Character.UnicodeBlock.of('१');
System.out.println(block.toString());
// DEVANAGARI

Devanagari is:

is an abugida (alphasyllabary) alphabet of India and Nepal

"१२३" is a "123" (Basic Latin unicode).

Reading:

It's more significant that they're of type `DECIMAL_DIGIT_NUMBER` than that they're in the `DEVANAGARI` block. There are non-digit letters in that block too. — Andy Turner, Oct 20 '16 at 08:10

Solomon Rutzky · Answer 4 · 2016-10-26T05:02:53.823

If you ever want to know what properties a particular "character" has (and there are quite a few), go directly to the source: Unicode.org. They have research tools that can show you most anything you would care to know.

If you want to see all of the properties of a specific character, try the following:

http://unicode.org/cldr/utility/character.jsp?a=१

or:

http://unicode.org/cldr/utility/character.jsp?a=%E0%A5%A7
If you want to see all characters classified as "decimal digits" (i.e. with number values of 0 through 9), try the following:

http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Numeric_Type=Decimal:]
^{( 550 Code Points -- currently / as of Unicode 9.0 )}
If you want to see all characters classified as "non-decimal digit numbers" (i.e. fractions, circled, etc), try the following:

http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Numeric_Type=Numeric:]
^{( 836 Code Points -- currently / as of Unicode 9.0 )}
If you want to see all characters classified as "decimal digits" (i.e. with number values of 0 through 9), but only up through Unicode 6.0 (which .NET uses), try the following:

http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Numeric_Type=Decimal:]%26[:Age=6.0:]
^{( 420 Code Points -- and shouldn't change )}
If you want to see all characters classified as "decimal digits" (i.e. with number values of 0 through 9), but only up through Unicode 6.0 (which .NET uses), and only in the Base-Multilingual Plane / no Supplementary Characters (i.e. nothing above Code Point 65535 / U+0xFFFF), try the following:

http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Numeric_Type=Decimal:]%26[:Age=6.0:]%26[:bmp=Yes:]
^{( 350 Code Points -- and shouldn't change )}

KEEP IN MIND: The Unicode Consortium produces a specification, not software. This means that it is up to each software vendor to implement the specification as accurately as they can. So just like HTML, JavaScript, CSS, SQL, etc, there is variation between different platforms, languages, and so on. For example, I found a bug in Microsoft's .NET Framework whereby circled Latin letters A-Z and a-z -- Code Points 0x24B6 through 0x24E9 -- do not properly register as being char.IsLetter = true (bug report here). And that leads to unexpected behavior in related functionality, such as when calling the TextInfo.ToTitleCase() method (bug report here).

Great references! (Though they do make me wonder if Unicode has gone over the top!) — PJTraill, Oct 20 '16 at 21:19
If you want to have this sort of reference available locally, you could install [uniprops](http://search.cpan.org/~bdfoy/Unicode-Tussle-1.111/script/uniprops). — TRiG, Oct 21 '16 at 14:44
@TRiG Thanks for mentioning that. Interesting utility. It does cover some of the functionality shown in the first 3 links (the original set), but I just updated my answer to include some additional links that show more advanced queries that can be done on Unicode.org that I don't see possible via `uniprops`. Also, it appears that `uniprops` is one version behind as Unicode released version 9.0 this past June. — Solomon Rutzky, Oct 21 '16 at 15:27

Nayan Katkani · Answer 5 · 2016-10-21T05:06:43.063

19

Symbols '१२३' are actually derived from Hindi language(Basically from Sanskrit language i.e Devanagiri) which represent numeric values just like:

१ represent 1

२ represent 2

and like wise

edited Oct 21 '16 at 05:06

answered Oct 20 '16 at 08:06

Nayan Katkani

806
8
18

4

CORRECTION: _Symbols '१२३' are actually derived from_ **Sanskrit** _language_ (i.e., Devanagiri script as other posters have noted) – Happy Green Kid Naps Oct 20 '16 at 16:37
I was surprised to learn how recently Devanāgarī took its present form – many centuries after Sanskrit was codified! So I'm skeptical of the claim that the digits belong more to Sanskrit than to Indian culture in general. – Anton Sherwood Oct 22 '16 at 03:55

Why does Apache Commons consider '१२३' numeric?

5 Answers5

Linked

Related