Questions tagged [character-properties]

character-properties are a set of attributes supplied by the Unicode Standard. For each character contained in it, many properties are specified in relation to processes or algorithms that interpret them, in order to implement the character behavior.

The Unicode Standard, on top of defining the encoding of characters, also associates a rich set of semantics with each encoded character—properties that are required for interoperability and correct behavior in implementations, as well as for Unicode conformance. These semantics are cataloged in the Unicode Character Database (UCD), a collection of data files which contain the Unicode character code points and character names.

More information can be found on Wikipedia, in the official Unicode Standard as well as in this Unicode Technical Report.

92 questions

258

votes

11 answers

How can I use Unicode-aware regular expressions in JavaScript?

There should be something akin to \w that can match any code-point in Letters or Marks category (not just the ASCII ones), and hopefully have filters like [[P*]] for punctuation, etc.

asked Nov 11 '08 at 12:00

Amit

134

votes

3 answers

Unicode equivalents for \w and \b in Java regular expressions?

Many modern regex implementations interpret the \w character class shorthand as "any letter, digit, or connecting punctuation" (usually: underscore). That way, a regex like \w+ matches words like hello, élève, GOÄ_432 or gefräßig. Unfortunately,…

java regex unicode character-properties

asked Nov 29 '10 at 15:00

Tim Pietzcker

328,213
58
503
561

votes

2 answers

Python and regular expression with Unicode

I need to delete some Unicode symbols from the string 'بِسْمِ اللَّهِ الرَّحْمَٰنِ الرَّحِيمِ' I know they exist here for sure. I tried: re.sub('([\u064B-\u0652\u06D4\u0670\u0674\u06D5-\u06ED]+)', '', 'بِسْمِ اللَّهِ الرَّحْمَٰنِ الرَّحِيمِ') but…

python regex python-2.x character-properties

asked Dec 26 '08 at 14:40

bsn

1,024
1
8
7

votes

11 answers

How to match Cyrillic characters with a regular expression

How do I match French and Russian Cyrillic alphabet characters with a regular expression? I only want to do the alpha characters, no numbers or special characters. Right now I have [A-Za-z]

regex unicode character-properties

asked Nov 11 '09 at 17:01

Greg Finzer

6,714
21
80
125

votes

6 answers

Python regex matching Unicode properties

Perl and some other current regex engines support Unicode properties, such as the category, in a regex. E.g. in Perl you can use \p{Ll} to match an arbitrary lower-case letter, or p{Zs} for any space separator. I don't see support for this in either…

python regex unicode ucd character-properties

asked Dec 02 '09 at 13:25

ThomasH

22,276
13
61
62

votes

1 answer

Matching only a unicode letter in Python re

I have a string from which i want to extract 3 groups: '19 janvier 2012' -> '19', 'janvier', '2012' Month name could contain non ASCII characters, so [A-Za-z] does not work for me: >>> import re >>> re.search(ur'(\d{,2}) ([A-Za-z]+) (\d{4})', u'20…

python regex unicode character-properties

asked Jan 19 '12 at 09:49

warvariuc

57,116
41
173
227

votes

3 answers

matching unicode characters in python regular expressions

I have read thru the other questions at Stackoverflow, but still no closer. Sorry, if this is allready answered, but I didn`t get anything proposed there to work. >>> import re >>> m =…

python regex unicode non-ascii-characters character-properties

asked Feb 17 '11 at 12:08

Weholt

1,889
5
22
35

votes

4 answers

Regex and unicode

I have a script that parses the filenames of TV episodes (show.name.s01e02.avi for example), grabs the episode name (from the www.thetvdb.com API) and automatically renames them into something nicer (Show Name - [01x02].avi) The script works fine,…

python regex unicode character-properties

asked Aug 18 '08 at 09:41

dbr

165,801
69
278
343

votes

2 answers

Is There a Way to Match Any Unicode Alphabetic Character?

I have some documents that went through OCR conversion from PDF into HTML. Because of that, they wound up having lots of random unicode punctuation where the converter messed up (i.e. elipses, etc...). They also correctly have a bunch of…

regex perl unicode character-properties

asked May 14 '11 at 23:32

Eli

36,793
40
144
207

votes

5 answers

How to know the preferred display width (in columns) of Unicode characters?

In different encodings of Unicode, for example UTF-16le or UTF-8, a character may occupy 2 or 3 bytes. Many Unicode applications doesn't take care of display width of Unicode chars just like they are all Latin letters. For example, in 80-column…

unicode text-formatting character-properties mbcs

asked Sep 03 '10 at 09:54

Lenik

13,946
17
75
103

votes

3 answers

Does \w match all alphanumeric characters defined in the Unicode standard?

Does Perl's \w match all alphanumeric characters defined in the Unicode standard? For example, will \w match all (say) Chinese and Russian alphanumeric characters? I wrote a simple test script (see below) which suggests that \w does indeed match "as…

regex perl unicode internationalization character-properties

asked Apr 05 '11 at 17:04

knorv

49,059
74
210
294

votes

2 answers

Match any unicode letter?

In .net you can use \p{L} to match any letter, how can I do the same in Python? Namely, I want to match any uppercase, lowercase, and accented letters.

python regex character-properties

asked Jun 11 '11 at 07:05

mpen

272,448
266
850
1,236

votes

3 answers

Matching (e.g.) a Unicode letter with Java regexps

There are many questions and answers here on StackOverflow that assume a "letter" can be matched in a regexp by [a-zA-Z]. However with Unicode there are many more characters that most people would regard as a letter (all the Greek letters, Cyrllic…

java regex unicode character-properties character-class

asked Mar 15 '11 at 17:10

The Archetypal Paul

41,321
20
104
134

votes

5 answers

Matching Unicode letter characters in PCRE/PHP

I'm trying to write a reasonably permissive validator for names in PHP, and my first attempt consists of the following pattern: // unicode letters, apostrophe, hyphen, space $namePattern = "/^([\\p{L}'\\- ])+$/"; This is eventually passed to a call…

php regex unicode pcre character-properties

asked Feb 13 '11 at 09:17

Jeff Lee

1,306
2
14
20

votes

5 answers

Unicode block of a character in python

Is there a way to get the Unicode Block of a character in python? The unicodedata module doesn't seem to have what I need, and I couldn't find an external library for it. Basically, I need the same functionality as Character.UnicodeBlock.of() in…

python unicode character-properties

asked Oct 28 '08 at 15:56

itsadok

28,822
30
126
171

2 3 4 5 6 7 Next