1

I want something like [A-z] that counts for all alphabetic characters plus stuff like ö, ä, ü etc.

If i do [A-ü] i get probably all special characters used by latin languages but it also allows other stuff like ¿¿]|{}[¢§øæ¬°µ©¥

Example: https://regex101.com/r/tN9gA5/2

Edit: I need this in python2.

yamm
  • 1,523
  • 1
  • 15
  • 25

2 Answers2

4

Depending on what regular expression engine you are using, you could use the ^\p{L}+$ regular expression. The \p{L} denotes a unicode letter:

In addition to complications, Unicode also brings new possibilities. One is that each Unicode character belongs to a certain category. You can match a single character belonging to the "letter" category with \p{L}

Source

This example should illustrate what I am saying. It seems that the regex engine on Regex101 does support this, you just need to select PCRE (PHP) fromo the top left.

npinti
  • 51,780
  • 5
  • 72
  • 96
  • this is probably the best idea but i am using python and the standard regex library does not support this. – yamm Apr 08 '15 at 09:04
1

When you use [A-z], you are not only capturing letters from "A" to "z", you also capture some more non-letter characters: [ \ ] ^ _ `.

In Python, you can use [^\W\d_] with re.U option to match Unicode characters (see this post).

Here is a sample based on your input string.

Python example:

import re
r = re.search(
    r'(?P<unicode_word>[^\W\d_]*)',
    u'TestöäüéàèÉÀÈéàè',
    re.U
)

print r.group('unicode_word')
>>> TestöäüéàèÉÀÈéàè
Community
  • 1
  • 1
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563