How do I get a list of all Unicode characters that have a given property?

Question

Without looping over the entire range of Unicode characters, how can I get a list of characters that have a given property? In particular I want a list of all characters that are digits (i.e. those that match /\d/). I have looked at Unicode::UCD, and it is useful for determining the properties of a given character, but there doesn't seem to be a way to get a list characters that have a property out of it.

score 6 · Accepted Answer · answered Jul 25 '09 at 16:51

6

The list of Unicode characters for each class is generated from the Unicode spec when you compile Perl, and is typically stored in /usr/lib/perl-YOURPERLVERSION/unicore/lib/gc_sc/

For example, the list of Unicode character ranges that match IsDigit (a.k.a. \d) is stored in the file /usr/lib/perl-YOURPERLVERSION/unicore/lib/gc_sc/Digit.pl

answered Jul 25 '09 at 16:51

tetromino

3,490
1
15
8

Thank you, this is almost exactly what I was looking for. I will still have loop over them to build a list, but at least that won't take forever and a day. – Chas. Owens Jul 25 '09 at 16:57

Chas. Owens · Answer 2 · 2009-07-25T20:18:11.720

Even better than unicore/lib/gc_sc/Digit.pl is unicore/To/Digit.pl. It is a direct mapping of Unicode digit characters (well, really their offsets) to their numeric values. This means instead of:

use Unicode::Digits qw/digit_to_int/;

my @digits;
for (split "\n", require "unicore/lib/gc_sc/Digit.pl") {
    my ($s, $e) = map hex, split;
    for (my $ord = $s; $ord <= $e; $ord++) {
        my $chr = chr $ord;
        push @{$digits[digits_to_int $chr]}, $chr;
    }
}

for my $i (0 .. 9) {
    my $re = join '', "[", @{$digits[$i]}, "]";
    $digits[$i] = qr/$re/;
}

I can say:

my @digits;
for (split "\n", require "unicore/To/Digit.pl") {
    my ($ord, $val) = split;
    my $chr = chr hex $ord;
    push @{$digits[$val]}, $chr;
}

for my $i (0 .. 9) {
    my $re = join '', "[", @{$digits[$i]}, "]";
    $digits[$i] = qr/$re/;
}

Or even better:

my @digits;
for (split "\n", require "unicore/To/Digit.pl") {
    my ($ord, $val) = split;
    $digits[$val] .= "\\x{$ord}";
}
@digits = map { qr/[$_]/ } @digits;

score 0 · Answer 3 · answered Jul 25 '09 at 16:28

0

which characters /\d/ match depends entirely on your regexp implementation (although standard 0-9 are guaranteed). In the case of perl the perl locale used defines which characters are considered alphabetic and digits.

answered Jul 25 '09 at 16:28

ewanm89

919
5
22

Perl transforms strings into utf8 before running them through the regex engine. The only thing that perl locale affects is how a raw byte string is transformed into utf8. Once a string is in utf8, perl will always use the same definition of IsDigit, independent of locale. – tetromino Jul 25 '09 at 16:56

score 0 · Answer 4 · answered Jul 25 '09 at 18:02

0

There is no way to do that without iterating through all the characters. (if you create a huge string with all of them and use a regexp you still have to do the loop at least once, to create the string).

answered Jul 25 '09 at 18:02

Mihai Nita

5,547
27
27

Happily part of the Perl build process creates a set of files under `unicore` in one of the lib directories that already have a lot of the work done for you. I don't know if they are official or not, I have a question in to the Perl 5 Porters list to find out if it safe to use them. – Chas. Owens Jul 25 '09 at 20:45

How do I get a list of all Unicode characters that have a given property?

4 Answers4

Linked