0

I would like to know if there is a recommendable RegEx pattern to match both English and non-English characters. So far I have come up with [^\x00-\x7F]+|[a-zA-Z'-]* based on the answer provided at SO. My solutions seems to work but since I am very nice to RegEx I would like to ask you to check this token and suggest some improvements. I am aware of most solutions that touch on this subject like this but I don't think there is already a good RegEx for this.

Community
  • 1
  • 1
menteith
  • 596
  • 14
  • 51

1 Answers1

0

The answer depend mostly on the language. But in general, you have to enable the "unicode flag" (this is usually done by prepending (?u) to your regex, or by appending /u) and use unicode strings. This way, \w, \s and others will correctly match the corresponding unicode characters.

An example in Python 2 (Python 3 uses unicode by default):

>>> re.match('\w', 'è')  # byte string, no unicode flag: no match
>>> re.match('(?u)\w', u'è')  # unicode string and unicode flag: match
<_sre.SRE_Match object at 0x7f258bac07e8>
>>> re.match('\w', u'è', re.UNICODE)  # another way to enable the unicode flag
<_sre.SRE_Match object at 0x7f258bac0850>
Andrea Corbellini
  • 17,339
  • 3
  • 53
  • 69
  • How to use it in regex101.com and in AutoHotKey? – menteith Mar 26 '16 at 19:30
  • @menteith: I'm not familiar with regex101 and I don't know what AutoHotKey is, sorry! Try googling "AutoHotKey unicode regex" and, by the way, update your question adding the [tag:autohotkey] tag and stating explicitly that your question is about AutoHotKey (otherwise your question might be closed as off-topic) – Andrea Corbellini Mar 26 '16 at 19:33