-1

I read before documentation and I wrote hundreds of regular expression but I have not idea how to detect sequence of unicode letter.

# this will detect sequence of English letters
re.compile(r'[a-zA-Z]+')
# this will detect sequence of Unicode letters + [0-9_]
re.compile(r'\w+', re.UNICODE)
# how to detect sequence only unicode letter (without [0-9_])
re.compile(r'????', re.UNICODE)

How to match only unicode characters without [0-9_]?


I tested your solutions:

import re
import timeit

def test1():
  regex = re.compile(ur'(?:(?![\d_])\w)+', re.UNICODE)
  return regex.findall(u'Ala ma kota z czarną sierścią - 1halo - halo1.')

def test2():
  regex = re.compile(ur'[^\W\d_]+', re.UNICODE)
  return regex.findall(u'Ala ma kota z czarną sierścią - 1halo - halo1.')

print test1()
print test2()

print timeit.timeit(test1)
print timeit.timeit(test2)

and times are:

[u'Ala', u'ma', u'kota', u'z', u'czarn\u0105', u'sier\u015bci\u0105', u'halo', u'halo']
[u'Ala', u'ma', u'kota', u'z', u'czarn\u0105', u'sier\u015bci\u0105', u'halo', u'halo']
11.0143377108
7.42619199741
Chameleon
  • 9,722
  • 16
  • 65
  • 127
  • 1
    What is your definition of "Unicode character"? "Unicode" covers *all* characters that are part of the Unicode specification. – Ignacio Vazquez-Abrams Feb 28 '15 at 16:46
  • 1
    maybe `re.compile(r'[^0-9_]',re.UNICODE)` – Aaron Feb 28 '15 at 16:47
  • You'll have to find all the ranges of desired characters yourself. – Malik Brahimi Feb 28 '15 at 16:47
  • Do you mean that you want to match all word characters (used to form words in any language) except the standard Latin characters A-Z and the standard digits 0-9? What about punctuation characters? Whitespace? Control characters? Symbolic characters (such as mathematical symbols)? The clearer you are about your requirements, the more likely you are to receive a good answer. – Bobulous Feb 28 '15 at 16:49
  • @Aaron `[^0-9_]` is not letters but spaces too - failed. – Chameleon Feb 28 '15 at 16:55
  • @IgnacioVazquez-Abrams For me unicode is 'ąćęłóńśżźĄĆĘŁÓŃŚŻŹ' whatever it not limits use of other letters. – Chameleon Feb 28 '15 at 16:56
  • Those are "non-ASCII Latin letters". – Ignacio Vazquez-Abrams Feb 28 '15 at 16:58
  • @IgnacioVazquez-Abrams What do you want say? – Chameleon Feb 28 '15 at 17:41
  • I would check what the docs say about regex constructs in Unicode mode. If it doesn't do properties, you should check the full extent of the `\w` construct, but I don't think that will be enough. –  Feb 28 '15 at 19:57

3 Answers3

3

You can combine a negative lookahead with \w to match "word characters" excluding digits and underscores:

re.compile(r"(?:(?![\d_])\w)+", re.UNICODE)
Blckknght
  • 100,903
  • 11
  • 120
  • 169
  • Failed `>>> re.findall(r'(?:(?![\d_])\w)+', 'Ala ma kota z czarną sierścią.', re.UNICODE) == ['Ala', 'ma', 'kota', 'z', 'czarn\xb9', 'sier', 'ci\xb9']` – Chameleon Feb 28 '15 at 17:00
  • I suspect this is an encoding issue with your string. It works for me using Python 3. If you're using Python 2, try putting a `u` before the quote of the string to make it a Unicode literal. – Blckknght Feb 28 '15 at 17:06
1

Use Unicode strings and a source encoding, then look for the characters you specified in your comment. Python 2.7 doesn't have a shortcut for "Unicode alpha characters":

# coding: utf8
import re
expr = re.compile(ur'(?u)[^\W\d_]+')
s = u'The quick brown fóx jumped over Łhe laży dog 17 times.'
for i in expr.finditer(s):
    print i.group(0)

Output:

The
quick
brown
fóx
jumped
over
Łhe
laży
dog
times

Also see this answer if you want all of what Unicode considers upper and lowercase Unicode letters.

Community
  • 1
  • 1
Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251
0

try this this matches any unicode character without numbers

re.compile(r'\D')