How to write regular expression matching all unicode characters in Python?

Question

I read before documentation and I wrote hundreds of regular expression but I have not idea how to detect sequence of unicode letter.

# this will detect sequence of English letters
re.compile(r'[a-zA-Z]+')
# this will detect sequence of Unicode letters + [0-9_]
re.compile(r'\w+', re.UNICODE)
# how to detect sequence only unicode letter (without [0-9_])
re.compile(r'????', re.UNICODE)

How to match only unicode characters without [0-9_]?

I tested your solutions:

import re
import timeit

def test1():
  regex = re.compile(ur'(?:(?![\d_])\w)+', re.UNICODE)
  return regex.findall(u'Ala ma kota z czarną sierścią - 1halo - halo1.')

def test2():
  regex = re.compile(ur'[^\W\d_]+', re.UNICODE)
  return regex.findall(u'Ala ma kota z czarną sierścią - 1halo - halo1.')

print test1()
print test2()

print timeit.timeit(test1)
print timeit.timeit(test2)

and times are:

[u'Ala', u'ma', u'kota', u'z', u'czarn\u0105', u'sier\u015bci\u0105', u'halo', u'halo']
[u'Ala', u'ma', u'kota', u'z', u'czarn\u0105', u'sier\u015bci\u0105', u'halo', u'halo']
11.0143377108
7.42619199741

What is your definition of "Unicode character"? "Unicode" covers *all* characters that are part of the Unicode specification. — Ignacio Vazquez-Abrams, Feb 28 '15 at 16:46
You'll have to find all the ranges of desired characters yourself. — Malik Brahimi, Feb 28 '15 at 16:47
Do you mean that you want to match all word characters (used to form words in any language) except the standard Latin characters A-Z and the standard digits 0-9? What about punctuation characters? Whitespace? Control characters? Symbolic characters (such as mathematical symbols)? The clearer you are about your requirements, the more likely you are to receive a good answer. — Bobulous, Feb 28 '15 at 16:49
@IgnacioVazquez-Abrams For me unicode is 'ąćęłóńśżźĄĆĘŁÓŃŚŻŹ' whatever it not limits use of other letters. — Chameleon, Feb 28 '15 at 16:56
I would check what the docs say about regex constructs in Unicode mode. If it doesn't do properties, you should check the full extent of the `\w` construct, but I don't think that will be enough. — , Feb 28 '15 at 19:57

score 3 · Accepted Answer · answered Feb 28 '15 at 16:53

3

You can combine a negative lookahead with \w to match "word characters" excluding digits and underscores:

re.compile(r"(?:(?![\d_])\w)+", re.UNICODE)

answered Feb 28 '15 at 16:53

Blckknght

100,903
11
120
169

Failed `>>> re.findall(r'(?:(?![\d_])\w)+', 'Ala ma kota z czarną sierścią.', re.UNICODE) == ['Ala', 'ma', 'kota', 'z', 'czarn\xb9', 'sier', 'ci\xb9']` – Chameleon Feb 28 '15 at 17:00
I suspect this is an encoding issue with your string. It works for me using Python 3. If you're using Python 2, try putting a `u` before the quote of the string to make it a Unicode literal. – Blckknght Feb 28 '15 at 17:06

score 1 · Answer 2 · edited May 23 '17 at 11:50

1

Use Unicode strings and a source encoding, then look for the characters you specified in your comment. Python 2.7 doesn't have a shortcut for "Unicode alpha characters":

# coding: utf8
import re
expr = re.compile(ur'(?u)[^\W\d_]+')
s = u'The quick brown fóx jumped over Łhe laży dog 17 times.'
for i in expr.finditer(s):
    print i.group(0)

Output:

The
quick
brown
fóx
jumped
over
Łhe
laży
dog
times

Also see this answer if you want all of what Unicode considers upper and lowercase Unicode letters.

edited May 23 '17 at 11:50

Community

1
1

answered Feb 28 '15 at 17:16

Mark Tolonen

166,664
26
169
251

Your solution is not good pattern since it is only for Polish language - better is `[^\W\d_]` I think but need to test or `(?:(?![\d_])\w)+`. – Chameleon Feb 28 '15 at 17:31
@Chameleon, also see the linked answer for a full solution. – Mark Tolonen Feb 28 '15 at 17:36
@Chameleon, `[^\W\d_]` works if you add Unicode flag. See updated, but make sure to use Unicode strings. – Mark Tolonen Feb 28 '15 at 17:42
I always use unicode since I doing global programs which use Polish, German and English. – Chameleon Feb 28 '15 at 17:46

score 0 · Answer 3 · answered Feb 28 '15 at 17:13

0

try this this matches any unicode character without numbers

re.compile(r'\D')

answered Feb 28 '15 at 17:13

Mohamed Ramzy Helmy

169
8

That'll do spaces and symbols as well, and would need `re.UNICODE` flag . – Mark Tolonen Feb 28 '15 at 17:17
Failure. Matches spaces too. – Chameleon Feb 28 '15 at 17:32

How to write regular expression matching all unicode characters in Python?

3 Answers3