5

I need the list of ranges of Unicode characters with the property Alphabetic as defined in http://www.unicode.org/Public/5.1.0/ucd/UCD.html#Alphabetic. However, I cannot find them in the Unicode Character Database no matter how I search for them. Can somebody provide a list of them or just a search facility for characters with specified Unicode properties?

tchrist
  • 78,834
  • 30
  • 123
  • 180
thSoft
  • 21,755
  • 5
  • 88
  • 103
  • If you look at my answer, I have per your request provided you with a search facility for characters with specified Unicode properties by way of [my unichars script](http://training.perl.com/scripts/unichars). Enjoy! – tchrist Jan 30 '11 at 15:46

4 Answers4

14

The Unicode Character Database comprises all the text files in the distribution. It is not just a single file as it once was long ago.

The Alphabetic property is a derived property.

You really do not want to use code point ranges for this. You want to use the property properly. That’s because there are just too many of them. Using the unichars script, we learn that there are more than ten thousand just in the Basic Multilingual Plane alone not counting Han or Hangul:

$ unichars '\p{Alphabetic}' | wc -l
   10052

If we include the other 16 astral planes, now we’re at fourteen thousand:

$ unichars -a '\p{Alphabetic}' | wc -l
   14736

And if we include Han and Hangul, which in fact the Alphabetic property does, we just blew the roof off of a hundred thousands code points:

$ unichars -ua '\p{Alphabetic}' | wc -l
  101539

I hope you can see that you do not want to specifically enumerate these using code point ranges. Down that road lies madness.

By the way, if you find the unichars script useful, you might also like the uniprops script and perhaps the uninames script.

thSoft
  • 21,755
  • 5
  • 88
  • 103
tchrist
  • 78,834
  • 30
  • 123
  • 180
  • 1
    I really like your scripts! They will be very useful for resolving a [SO question](http://stackoverflow.com/questions/6246651/generate-uri-friendly-unicode-code-points-from-integer-counter) I had. Thanks so much for making them. Question: when I just ran the last command above (`unichars -ua '\p{Alphabetic}' | wc -l`), I got 94332 lines instead of 101539. Any reason why that might be? – Abe Voelker Jun 13 '11 at 00:13
  • 1
    @Abe: Prolly cause you are not running Unicode 6.0.0 yet. What version of Perl are you running? `corelist -a Unicode` will show you the pairings of Perl versions with Unicode versions. BTW, I now have        in my [Unicode toolchest](http://training.perl.com/scripts/), with more on the way. – tchrist Jun 13 '11 at 00:42
  • Ah yes, I am running Perl 5 still. I'll definitely upgrade Perl and check out your new tools. Thanks! – Abe Voelker Jun 13 '11 at 01:22
  • @Able Perl v5.8.8 had Unicode v4.1; Perl v5.8.9 and Perl v5.10.1 had Unicode v5.1; Perl v5.12 had Unicode v5.2; and Perl v5.14 has Unicode v6.0.0. I would install Perl v5.14 if you can, and v5.12 if you cannot. Just make sure to do the CPAN `autobundle` trick to upgrade all your post-facto installed CPAN modules. – tchrist Jun 13 '11 at 01:41
  • Link no longer works and is not archived by Interent Archive. Searching turns up https://metacpan.org/pod/distribution/Unicode-Tussle/script/unichars and https://github.com/turian/common-scripts/blob/master/unichars – Jacob C. Mar 19 '21 at 20:47
3

Derived Core Properties can be calculated from the other properties.

The Alphabetic property is defined as: Generated from: Lu + Ll + Lt + Lm + Lo + Nl + Other_Alphabetic

So, if you take all the characters in Lu, Ll, Lt, Lm, Lo, Nl, and all the characters with the Other_Alphabetic property, you will have the Alphabetic characters.

Avi
  • 19,934
  • 4
  • 57
  • 70
2

Citation from your source: Generated from: Lu + Ll + Lt + Lm + Lo + Nl + Other_Alphabetic

These Abbrevations seem to be explained here.

flying sheep
  • 8,475
  • 5
  • 56
  • 73
1

I found the UniView web application which provides a nice search interface. Searching for the Letter property (with Local unchecked) gives 14723 results...

thSoft
  • 21,755
  • 5
  • 88
  • 103
  • 1
    The Letter property is not the same as the Alphabetic property!!!! In Unicode 6.0.0, there are 101539 code points with the Alphabetic property but only 100520 with the Letter property, a difference of over a thousand characters. BTW, your 14k answer is off by an order of magnitude. – tchrist Jan 30 '11 at 20:39
  • You're right. BTW, I think the UniView tool doesn't take Han and Hangul into account. – thSoft Jan 30 '11 at 22:46
  • Link is dead :/ – Artemis Dec 11 '21 at 12:27