Questions tagged [character-properties]

character-properties are a set of attributes supplied by the Unicode Standard. For each character contained in it, many properties are specified in relation to processes or algorithms that interpret them, in order to implement the character behavior.

The Unicode Standard, on top of defining the encoding of characters, also associates a rich set of semantics with each encoded character—properties that are required for interoperability and correct behavior in implementations, as well as for Unicode conformance. These semantics are cataloged in the Unicode Character Database (UCD), a collection of data files which contain the Unicode character code points and character names.

More information can be found on Wikipedia, in the official Unicode Standard as well as in this Unicode Technical Report.

92 questions
258
votes
11 answers

How can I use Unicode-aware regular expressions in JavaScript?

There should be something akin to \w that can match any code-point in Letters or Marks category (not just the ASCII ones), and hopefully have filters like [[P*]] for punctuation, etc.
Amit
134
votes
3 answers

Unicode equivalents for \w and \b in Java regular expressions?

Many modern regex implementations interpret the \w character class shorthand as "any letter, digit, or connecting punctuation" (usually: underscore). That way, a regex like \w+ matches words like hello, élève, GOÄ_432 or gefräßig. Unfortunately,…
Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561
91
votes
2 answers

Python and regular expression with Unicode

I need to delete some Unicode symbols from the string 'بِسْمِ اللَّهِ الرَّحْمَٰنِ الرَّحِيمِ' I know they exist here for sure. I tried: re.sub('([\u064B-\u0652\u06D4\u0670\u0674\u06D5-\u06ED]+)', '', 'بِسْمِ اللَّهِ الرَّحْمَٰنِ الرَّحِيمِ') but…
bsn
  • 1,024
  • 1
  • 8
  • 7
85
votes
11 answers

How to match Cyrillic characters with a regular expression

How do I match French and Russian Cyrillic alphabet characters with a regular expression? I only want to do the alpha characters, no numbers or special characters. Right now I have [A-Za-z]
Greg Finzer
  • 6,714
  • 21
  • 80
  • 125
77
votes
6 answers

Python regex matching Unicode properties

Perl and some other current regex engines support Unicode properties, such as the category, in a regex. E.g. in Perl you can use \p{Ll} to match an arbitrary lower-case letter, or p{Zs} for any space separator. I don't see support for this in either…
ThomasH
  • 22,276
  • 13
  • 61
  • 62
33
votes
1 answer

Matching only a unicode letter in Python re

I have a string from which i want to extract 3 groups: '19 janvier 2012' -> '19', 'janvier', '2012' Month name could contain non ASCII characters, so [A-Za-z] does not work for me: >>> import re >>> re.search(ur'(\d{,2}) ([A-Za-z]+) (\d{4})', u'20…
warvariuc
  • 57,116
  • 41
  • 173
  • 227
32
votes
3 answers

matching unicode characters in python regular expressions

I have read thru the other questions at Stackoverflow, but still no closer. Sorry, if this is allready answered, but I didn`t get anything proposed there to work. >>> import re >>> m =…
Weholt
  • 1,889
  • 5
  • 22
  • 35
31
votes
4 answers

Regex and unicode

I have a script that parses the filenames of TV episodes (show.name.s01e02.avi for example), grabs the episode name (from the www.thetvdb.com API) and automatically renames them into something nicer (Show Name - [01x02].avi) The script works fine,…
dbr
  • 165,801
  • 69
  • 278
  • 343
27
votes
2 answers

Is There a Way to Match Any Unicode Alphabetic Character?

I have some documents that went through OCR conversion from PDF into HTML. Because of that, they wound up having lots of random unicode punctuation where the converter messed up (i.e. elipses, etc...). They also correctly have a bunch of…
Eli
  • 36,793
  • 40
  • 144
  • 207
24
votes
5 answers

How to know the preferred display width (in columns) of Unicode characters?

In different encodings of Unicode, for example UTF-16le or UTF-8, a character may occupy 2 or 3 bytes. Many Unicode applications doesn't take care of display width of Unicode chars just like they are all Latin letters. For example, in 80-column…
Lenik
  • 13,946
  • 17
  • 75
  • 103
23
votes
3 answers

Does \w match all alphanumeric characters defined in the Unicode standard?

Does Perl's \w match all alphanumeric characters defined in the Unicode standard? For example, will \w match all (say) Chinese and Russian alphanumeric characters? I wrote a simple test script (see below) which suggests that \w does indeed match "as…
knorv
  • 49,059
  • 74
  • 210
  • 294
19
votes
2 answers

Match any unicode letter?

In .net you can use \p{L} to match any letter, how can I do the same in Python? Namely, I want to match any uppercase, lowercase, and accented letters.
mpen
  • 272,448
  • 266
  • 850
  • 1,236
17
votes
3 answers

Matching (e.g.) a Unicode letter with Java regexps

There are many questions and answers here on StackOverflow that assume a "letter" can be matched in a regexp by [a-zA-Z]. However with Unicode there are many more characters that most people would regard as a letter (all the Greek letters, Cyrllic…
The Archetypal Paul
  • 41,321
  • 20
  • 104
  • 134
17
votes
5 answers

Matching Unicode letter characters in PCRE/PHP

I'm trying to write a reasonably permissive validator for names in PHP, and my first attempt consists of the following pattern: // unicode letters, apostrophe, hyphen, space $namePattern = "/^([\\p{L}'\\- ])+$/"; This is eventually passed to a call…
Jeff Lee
  • 1,306
  • 2
  • 14
  • 20
17
votes
5 answers

Unicode block of a character in python

Is there a way to get the Unicode Block of a character in python? The unicodedata module doesn't seem to have what I need, and I couldn't find an external library for it. Basically, I need the same functionality as Character.UnicodeBlock.of() in…
itsadok
  • 28,822
  • 30
  • 126
  • 171
1
2 3 4 5 6 7