RegEx: A way to handle both English and non-English characters (and my solution)

Question

I would like to know if there is a recommendable RegEx pattern to match both English and non-English characters. So far I have come up with [^\x00-\x7F]+|[a-zA-Z'-]* based on the answer provided at SO. My solutions seems to work but since I am very nice to RegEx I would like to ask you to check this token and suggest some improvements. I am aware of most solutions that touch on this subject like this but I don't think there is already a good RegEx for this.

Andrea Corbellini · Answer 1 · 2016-03-26T19:17:10.513

0

The answer depend mostly on the language. But in general, you have to enable the "unicode flag" (this is usually done by prepending (?u) to your regex, or by appending /u) and use unicode strings. This way, \w, \s and others will correctly match the corresponding unicode characters.

An example in Python 2 (Python 3 uses unicode by default):

>>> re.match('\w', 'è')  # byte string, no unicode flag: no match
>>> re.match('(?u)\w', u'è')  # unicode string and unicode flag: match
<_sre.SRE_Match object at 0x7f258bac07e8>
>>> re.match('\w', u'è', re.UNICODE)  # another way to enable the unicode flag
<_sre.SRE_Match object at 0x7f258bac0850>

edited Mar 26 '16 at 19:17

answered Mar 26 '16 at 19:08

Andrea Corbellini

17,339
3
53
69

How to use it in regex101.com and in AutoHotKey? – menteith Mar 26 '16 at 19:30
@menteith: I'm not familiar with regex101 and I don't know what AutoHotKey is, sorry! Try googling "AutoHotKey unicode regex" and, by the way, update your question adding the [tag:autohotkey] tag and stating explicitly that your question is about AutoHotKey (otherwise your question might be closed as off-topic) – Andrea Corbellini Mar 26 '16 at 19:33

RegEx: A way to handle both English and non-English characters (and my solution)

1 Answers1