21

How can I match a letter from any language using a regex in python 3?

re.match([a-zA-Z]) will match the english language characters but I want all languages to be supported simultaneously.

I don't wish to match the ' in can't or underscores or any other type of formatting. I do wish my regex to match: c, a, n, t, Å, é, and .

tchrist
  • 78,834
  • 30
  • 123
  • 180
Baz
  • 12,713
  • 38
  • 145
  • 268
  • I can't think of a logical way to go about this. Most languages do not match the english alphabet. For instance, if you tried to match a 'k' in japanese you wouldn't be able to do it. Their language only contains 'ka' 'ki' 'ku' 'ke' 'ko' but they are represented by symbols so a K would not match to any specific character. For this to work, you would essentially need to take a language and "translate" it into an english equivalent. So if you encountered "good morning" in japanese こんにちは you would have to "translate" that to "konnichiwa" before doing a regex match. – Tony318 Aug 26 '11 at 14:57
  • 3
    @Tony318 I happen to have majored in Japanese ... That approach is wrong on so many levels... – ty812 Aug 26 '11 at 15:02
  • 3
    possible duplicate of http://stackoverflow.com/questions/2039140/python-re-how-do-i-match-an-alpha-character – Marty Aug 26 '11 at 15:04
  • @Martin Despite it being "right" or "wrong" how else could you possibly go about using a regular expression to match an alphabet to a phonetic syllabary? – Tony318 Aug 26 '11 at 15:05
  • @Marty, good catch, the answer there even used `\d` in the negated character class (unlike my puny answer). – Frédéric Hamidi Aug 26 '11 at 15:13
  • @Tony318: This is not only *not* hard, it is trivial with Unicode regular expressions, because Unicode has a derived property called Alphabetic that handles these. Note that the Unicode property Alphabetic is **not the same** as all the letters. Rather, it is GC=Letter + GC=Letter_Number + Other_Alphabetic, which picks up things like the Greek iota subscript and the circled letters. – tchrist Aug 26 '11 at 15:59
  • I think what you want here is not a regex, but Python's built-in [isalpha](https://docs.python.org/2/library/stdtypes.html#str.isalpha). For example: `str([c for c in s if c.isalpha()])` – Adam Bittlingmayer Mar 10 '16 at 12:16

7 Answers7

23

For Unicode regex work in Python, I very strongly recommend the following:

  1. Use Matthew Barnett’s regex library instead of standard re, which is not really suitable for Unicode regular expressions.
  2. Use only Python 3, never Python 2. You want all your strings to be Unicode strings.
  3. Use only string literals with logical/abstract Unicode codepoints, not encoded byte strings.
  4. Set your encoding on your streams and forget about it. If you find yourself ever manually calling .encode and such, you’re almost certainly doing something wrong.
  5. Use only a wide build where code points and code units are the same, never ever ever a narrow one — which you might do well to consider deprecated for Unicode robustness.
  6. Normalize all incoming strings to NFD on the way in and then NFC on the way out. Otherwise you can’t get reliable behavior.

Once you do this, you can safely write patterns that include \w or \p{script=Latin} or \p{alpha} and \p{lower} etc and know that these will all do what the Unicode Standard says they should. I explain all of this business of Python Unicode regex business in much more detail in this answer. The short story is to always use regex not re.

For general Unicode advice, I also have several talks from last OSCON about Unicode regular expressions, most of which apart from the 3rd talk alone is not about Python, but much of which is adaptable.

Finally, there’s always this answer to put the fear of God (or at least, of Unicode) in your heart.

Community
  • 1
  • 1
tchrist
  • 78,834
  • 30
  • 123
  • 180
  • yeah, except when u have to do it in redshift udf, which is only python 2 and doesn't support anything with an embedded c library, so barnett's regex library is out – user433342 May 25 '23 at 20:43
7

What's wrong with using the \w special sequence?

# -*- coding: utf-8 -*-
import re
test = u"can't, Å, é, and 中ABC"
print re.findall('\w+', test, re.UNICODE)
Björn Lindqvist
  • 19,221
  • 20
  • 87
  • 122
  • 2
    `\w` matches also digit `[0-9]` and underscore `_` – Toto Aug 26 '11 at 15:26
  • 1
    Very good, that is the right answer under the standard Python library (although I always use Unicode literals myself). Note that according to [UTS#18](http://unicode.org/reports/tr18/#Categories), a “word” char à la `\w` covers 102,724 code points in Unicode 6.0 and is any GC=L (100,520), GC=M (1,492), GC=Nd (420), GC=Nl (224), or GC=Pc (10) code point. Python’s `re` is a bit dated so hasn’t kept up with the standard, but it is close-ish. You can use Matthew Barnett’s `regex` instead if you want to match the Unicode Standard exactly; it also provides `\p{alpha}`, which is what you want here. – tchrist Aug 26 '11 at 15:29
  • @M42: It’s rather more complicated than that, but yes. Python’s normal `re` library is [not good for Unicode](http://stackoverflow.com/questions/7063420/perl-compatible-regular-expression-pcre-in-python/7066413#7066413), although it’s close to [RL1.2a](http://unicode.org/reports/tr18/#Compatibility_Properties) but lacks basic properties per [RL1.2](http://unicode.org/reports/tr18/#Categories) and full properties per [RL2.7](http://www.unicode.org/reports/tr18/tr18-14.html#Full_Properties). For almost any Unicode regex work in Python you should use Matthew Barnett’s `regex` library instead. – tchrist Aug 26 '11 at 15:36
4

You can match on

\p{L}

which matches any Unicode code point that represents a letter of a script. That is, assuming you actually have a Unicode-capable regex engine, which I really hope Python would have.

Joey
  • 344,408
  • 85
  • 689
  • 683
1

Build a match class of all the characters you want to match. This might become very, very large. No, there is no RegEx shorthand for "All Kanji" ;)

Maybe it is easier to match for what you do not want, but even then, this class would become extremely large.

ty812
  • 3,293
  • 19
  • 36
  • I didn't realise it would be so tricky... I'll start by making a histogram of the characters in all the text I wish to process... – Baz Aug 26 '11 at 15:03
  • That depends on the regex engine. You can match on the script property in some engines, such as Perl's where you can just select the Han script (those are *Han* characters, even though they're used by the Japanese as well and called Kanji there). – Joey Aug 26 '11 at 15:08
  • Sure, you can do that for *one* script a time - but not for *every* one at the same time (Let's not discusss the Han/Kanji problematics here... that is a long, bloody history, no, many characters are not exactly alike) – ty812 Aug 26 '11 at 15:12
  • My histogram:{'\x91': 3, '¡': 1, ' ': 559754, '£': 18, '$': 111, '&': 5, '.': 75528, '0': 1690, '2': 676, '4': 347, '6': 285, '8': 193, ':': 389, 'á': 2, 'ã': 1, 'b': 33405, 'å': 3, 'd': 75349, 'f': 36969, 'é': 35, 'h': 126063, 'j': 5085, 'í': 1, 'l': 89233, 'n': 141244, 'ñ': 6, 'p': 30443, 'ó': 7, 'r': 111273, 't': 201222, 'v': 20584, 'x': 2350, 'z': 2295, '\x92': 14, '\x94': 1, '!': 10896, '#': 568, '%': 5, "'": 33612, '+': 1, '-': 15667, '/': 12, '®': 1, '1': 968, '3': 332, '5': 520, '´': 3, '7': 232, '¶': 1, '9': 275, '=': 1, '?': 17280, '[': 2, 'a': 163349, 'à': 4, 'c': 47806, 'â': 1, ... – Baz Aug 26 '11 at 15:31
  • 1
    @Baz: I recommend against the histogram. BTW, You forgot to decode your text!!! U+0092 is `PRIVATE USE TWO` and U+0094 is `CANCEL CHARACTER`, which are really never going to be found in legitimate Unicode text. Those are not printable characters. which is why Python escaped them. Somehow you have lied about your decoding, or something lied to you — I’ll bet $100 that it was Microsoft, the usual culprit in these mendacities. Change your decoder from real Latin-1, meaning ISO-8859-1, to the bogus proprietary non-Latin-1 version called Windows-1252, and those should go away. – tchrist Aug 26 '11 at 15:56
  • @Martin Of course there is a regex shorthand for “all Kanji”: it’s `\p{Script=Han}`, or just `\p{Han}` for short. You won’t be able to get the thing’s actual “name” without going through something like the Unihan database, but you certainly know which code points are CJK ideographs! I regularly use pattern that restrict to certain scripts, like `[\p{Latin}\p{Greek}\p{Common}\p{Inherited}]`. – tchrist Aug 26 '11 at 22:41
1
import re

text = "can't, Å, é, and 中ABC"
print(re.findall('\w+', text))

This works in Python 3. But it also matches underscores. However this seems to do the job as I wish:

import regex

text = "can't, Å, é, and 中ABC _ sh_t"
print(regex.findall('\p{alpha}+', text))
Baz
  • 12,713
  • 38
  • 145
  • 268
0

As noted by others, it would be very difficult to keep the up-to-date database of all letters in all existing languages. But in most cases you don't actually need that and it can be perfectly fine for your code to begin by supporing just several chosen languages and adding others as needed.

The following simple code supports matching for Czech, German and Polish language. The character sets can be easily obtained from Wikipedia.

import re

LANGS = [
    'ÁáČčĎďÉéĚěÍíŇňÓóŘřŠšŤťÚúŮůÝýŽž',   # Czech
    'ÄäÖöÜüẞß',                         # German
    'ĄąĆćĘꣳŃńÓóŚśŹźŻż',               # Polish
    ]

pattern = '[A-Za-z{langs}]'.format(langs=''.join(LANGS))
pattern = re.compile(pattern)
result = pattern.findall('Žluťoučký kůň')

print(result)

# ['Ž', 'l', 'u', 'ť', 'o', 'u', 'č', 'k', 'ý', 'k', 'ů', 'ň']
Jeyekomon
  • 2,878
  • 2
  • 27
  • 37
-1

For Portuguese language, use try this one:

[a-zA-ZÀ-ú ]+
Marcelo Gumiero
  • 1,849
  • 2
  • 14
  • 14