I would like to know if there is a recommendable RegEx pattern to match both English and non-English characters. So far I have come up with [^\x00-\x7F]+|[a-zA-Z'-]* based on the answer provided at SO. My solutions seems to work but since I am very nice to RegEx I would like to ask you to check this token and suggest some improvements. I am aware of most solutions that touch on this subject like this but I don't think there is already a good RegEx for this.
Asked
Active
Viewed 149 times
1 Answers
0
The answer depend mostly on the language. But in general, you have to enable the "unicode flag" (this is usually done by prepending (?u)
to your regex, or by appending /u
) and use unicode strings. This way, \w
, \s
and others will correctly match the corresponding unicode characters.
An example in Python 2 (Python 3 uses unicode by default):
>>> re.match('\w', 'è') # byte string, no unicode flag: no match
>>> re.match('(?u)\w', u'è') # unicode string and unicode flag: match
<_sre.SRE_Match object at 0x7f258bac07e8>
>>> re.match('\w', u'è', re.UNICODE) # another way to enable the unicode flag
<_sre.SRE_Match object at 0x7f258bac0850>

Andrea Corbellini
- 17,339
- 3
- 53
- 69
-
How to use it in regex101.com and in AutoHotKey? – menteith Mar 26 '16 at 19:30
-
@menteith: I'm not familiar with regex101 and I don't know what AutoHotKey is, sorry! Try googling "AutoHotKey unicode regex" and, by the way, update your question adding the [tag:autohotkey] tag and stating explicitly that your question is about AutoHotKey (otherwise your question might be closed as off-topic) – Andrea Corbellini Mar 26 '16 at 19:33