Should I use \d or [0-9] to match digits in a Perl regex?

Question

Having read a number of questions/answers over the past few weeks, I have seen the use of \d in perl regular expressions commented on as incorrect. As in the later versions of perl \d is not the same as [0-9], as \d will represent any Unicode character that has the digit attribute, and that [0-9] represents the characters '0', '1', '2', ..., '9'.

I appreciate that in some contexts [0-9] will be the correct thing to use, and in others \d will be. I was wondering which people feel is the correct default to use?

Personally I find the \d notation very succinct and expressive, whereas in comparison [0-9] is somewhat cumbersome. But I have little experience of doing multi-language code, or rather code for languages that do not fit into the ASCII character range, and therefore may be being naive.

I notice

$find /System/Library/Perl/5.8.8/ -name \*pm | xargs grep '\\d' | wc -l
  298
$find /System/Library/Perl/5.8.8/ -name \*pm | xargs grep '\[0-9\]' | wc -l
  26

score 85 · Answer 1 · edited Jul 11 '14 at 21:43

85

It seems to me very dangerous to use \d, It is a poor design decision in the language, as in most cases you want [0-9]. Huffman-coding would dictate the use of \d for ASCII numbers.

Most of the previous posters have already highlighted why you should use [0-9], so let me give you a bit more data:

If I read the unicode charts correctly '۷۰' is a number (70 in indic, don't take my word for it).

Try this:

$ perl -le '$one = chr 0xFF11; print "$one + 1 = ", $one+1;'
１ + 1 = 1

Here is a partial list of valid numbers (which may or may not show up properly in your browser, depending on the fonts you use), for each number, only the first of those being interpreted as a number when doing arithmetics with Perl, as shown above:

 ZERO:  0٠۰߀०০੦૦୦௦౦೦൦๐໐０
 ONE:   1١۱߁१১੧૧୧௧౧೧൧๑໑１
 TWO:   2٢۲߂२২੨૨୨௨౨೨൨๒໒２
 THREE: 3٣۳߃३৩੩૩୩௩౩೩൩๓໓３
 FOUR:  4٤۴߄४৪੪૪୪௪౪೪൪๔໔４
 FIVE:  5٥۵߅५৫੫૫୫௫౫೫൫๕໕５
 SIX:   6٦۶߆६৬੬૬୬௬౬೬൬๖໖６
 SEVEN: 7٧۷߇७৭੭૭୭௭౭೭൭๗໗７
 EIGHT: 8٨۸߈८৮੮૮୮௮౮೮൮๘໘８
 NINE:  9٩۹߉९৯੯૯୯௯౯೯൯๙໙９��

Are you still not convinced?

edited Jul 11 '14 at 21:43

Miller

34,962
4
39
60

answered May 21 '09 at 07:18

mirod

15,923
3
45
65

14

+1 for that list! I was beginning to wonder which other number characters there were. – nickf May 21 '09 at 08:04
2

If Perl has embraced UNICODE this far, then it seems like it should go the rest of the way and handle all the digits. Of course, that way lies madness, but isn't madness the fate of all Perl programmers ;-) ? – RBerteig May 21 '09 at 08:06
there are still more characters, but I only included the ones that I could display on my system. I used the unicode data from http://www.unicode.org/Public/UNIDATA/UnicodeData.txt, and extracted the character info from there. – mirod May 21 '09 at 09:17
I understand and appreciate that if you are dealing with UNICODE you need to handle numbers that perl cannot do arithmetic with. If I use \d I may end up with numbers that I cannot do arithmetic on, but if I use [0-9] I may miss out on numbers that I wanted to capture....so which is right - it's all down to the context of the input. I suppose I find it non-intuitive that perl decided to have the shorthand \d mean any number character and not any number character that I can do arithmentic on, or at least not provide another suitable shorthand. – Beano May 21 '09 at 10:15
1

@nickf At my current count there are 61 sets of digits, see the module link in my answer for the list. – Chas. Owens May 21 '09 at 13:36
2

@Beano I am not saying don't use \d; I am saying don't use \d when you mean [0-9]. It is similar to not using \s when you mean [ ]. The question comes down to do you mind matching ⑤ as well as 5? – Chas. Owens May 21 '09 at 13:46

score 50 · Accepted Answer · edited Jul 11 '14 at 21:32

For maximum safety, I'd suggest using [0-9] any time you don't specifically intend to match all unicode-defined digits.

Per perldoc perluniintro, Perl does not support using digits other than [0-9] as numbers, so I would definitely use [0-9] if the following are both true:

You want to use the result as a number (such as performing mathematical operations on it or storing it somewhere that only accepts proper numbers (e.g. an INT column in a database)).
It is possible non-digits [^0-9] would be present in the data in such a way that the regular expression could match them. (Note that this one should always be considered true for untrusted/hostile input.)

If either of these are false, there will only rarely be reason to specifically not use \d (and you'll probably be able to tell when that is the case), and if you're trying to match all unicode-defined digits, you'll definitely want to use \d.

\d can indeed match more than 10 different characters, if applied to Unicode strings. — pts, May 20 '09 at 23:31
`\d` matches anything with a numeric property. If you want only 0,1,2,3,4,5,6,7,8, and 9, match that with [0-9] or add `/a` to get ASCII semantics to the character class shortcuts https://www.effectiveperlprogramming.com/2011/01/know-your-character-classes/ — brian d foy, Dec 17 '22 at 07:38

score 11 · Answer 3 · edited Dec 17 '22 at 07:41

According to perlreref, \d is locale-aware and Unicode aware.

However, if the codeset you are using is not Unicode, then you don't need to worry about the Unicode digits, and if the codeset you are using is something like Latin-1 (ISO 8859-1, or 8859-15), then the locale-awareness won't hurt you either because the codeset does not include any other digit characters.

So, for many people, much of the time, you can use \d without concern. However, if Unicode data is part of your work, then you need to consider what you are after more carefully.

Chas. Owens · Answer 4 · 2009-05-21T13:33:39.373

Just like nuking the site from orbit, [0-9] is the only way to be sure. Yeah, it is ugly. Yeah, the choice to make \d be UNICODE and locale aware was stupid. But this is our bed and we have to lie in it.

As for the people ducking their heads in the sand saying it doesn't effect the character set they are using today, well you may be using that character set today, but the rest of the world is using UTF-8 now and you will be using it soon as well. Remember to code like the guy who maintains your code is a homicidal maniac who knows where you live.

Oh, and as for Perl modules using \d vs [0-9], even the core still has UNICODE problems.

If you do in fact mean any digit, but want to be able to do math with the results, you can use Text::Unidecode:

#!/usr/bin/perl

use strict;
use warnings;

use Text::Unidecode;

my $number = "\x{1811}\x{1812}\x{1813}\x{1814}\x{1815}";
print "$number is ", unidecode($number), "\n";

After some more testing it looks like Text::Unidecode doesn't handle all digit characters correctly. I am writing a module that will work.

score 4 · Answer 5 · edited Oct 21 '16 at 14:58

4

I feel both must have their place. However, 99.999% of the time (especially in my closed big American cooperation world) they are interchangeable. I use perl to manipulate data every day and in none of the data sets I deal with are there numbers that don’t fit in [0-9]. However, I do appreciate that there is an important distinction between \d and[0-9] and it’s good to be aware of that difference. I use \d because it seems more succinct (as you said) and would never be “wrong” in my small world of data manipulation.

edited Oct 21 '16 at 14:58

TRiG

10,148
7
57
107

answered May 20 '09 at 23:21

Copas

5,921
5
29
43

You want \d not /d - if you want it at all. – Telemachus May 21 '09 at 00:26
`\d` matches anything with a numeric property. If you want only 0,1,2,3,4,5,6,7,8, and 9, match that with [0-9] or add `/a` to get ASCII semantics to the character class shortcuts https://www.effectiveperlprogramming.com/2011/01/know-your-character-classes/ – brian d foy Dec 17 '22 at 07:40

rimiha · Answer 6 · 2022-12-21T19:22:50.653

2

The prime objection above for using \d seems to be the non-ascii numeric digits.

This can be obviated with the /a option. e.g.:

m/\d/a

This restricts the digit matching to ASCII only.

https://perldoc.perl.org/perlre#/a-(and-/aa):

Under /a, \d always means precisely the digits "0" to "9"

edited Dec 21 '22 at 19:22

answered Dec 16 '22 at 23:39

rimiha

57
5

Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Dec 19 '22 at 22:10

score 2 · Answer 7 · answered May 20 '09 at 23:29

2

If you apply \d to a Unicode string (such as in "\X{660}" =~ /\d/), it will match a Unicode digit. If you apply \d to a binary string (such as the UTF-8 equivalent of the above: "\xd9\xa0" =~ /\d/), it will match only the 10 ASCII digits. Perl 5.8 doesn't create Unicode strings by default (unless you specifically ask for it, such as in "\X{...}" or use utf8; etc.).

So my advice is: only pay attention to the difference between \d and [0-9] if your application uses Unicode strings.

answered May 20 '09 at 23:29

pts

80,836
20
110
183

2

Why allow for the distinction if there's a way that you can get exactly what you mean every time? `\d` matches anything with a numeric property. If you want only 0,1,2,3,4,5,6,7,8, and 9, match that with [0-9] or add `/a` to get ASCII semantics to the character class shortcuts https://www.effectiveperlprogramming.com/2011/01/know-your-character-classes/ – brian d foy Dec 17 '22 at 07:41

score 1 · Answer 8 · edited Jun 21 '13 at 10:50

1

If [0-9] feels clunky perhaps you could define: $d=qr/[0-9]/; and use that instead of \d.

edited Jun 21 '13 at 10:50

Nakilon

34,866
14
107
142

answered May 21 '09 at 15:04

score -2 · Answer 9 · answered Jun 24 '16 at 13:43

As data format controls go up, the need for pattern specificity goes down...

Example, if you are matching a piece of data that has been machine generated and always follows the same output formatting rules, you don't need to be so precise. Take IPv4 addresses. if you are trying to extract the IP address from a router interface configuration line, all you really need is something like:

 'ip\haddress\h(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})\D'

IF, on the other hand, you are trying to find an IP address embedded deep somewhere in, say, an email X-Header, or if you are trying to VALIDATE an IP address, well..that is a whole 'nother story!

Should I use \d or [0-9] to match digits in a Perl regex?

9 Answers9

Linked