2

I am new to Python regex and am trying to match non-white space ASCII characters in Python.

The following is my code:

impore re

p = re.compile(r"[\S]{2,3}", re.ASCII)

p.search('1234')  # have some result

p.search('你好吗') # also have result, but Why?

I have specified ASCII mode in re.compile, but p.search('你好吗') still have result. I wonder what I am doing wrong here?

jdhao
  • 24,001
  • 18
  • 134
  • 273
  • you have to use `\u` refer this https://stackoverflow.com/questions/393843/python-and-regular-expression-with-unicode – Dickens A S Apr 14 '20 at 04:08
  • re.ASCII is not what you think, in this case – Nikos M. Apr 14 '20 at 04:21
  • @NikosM. so what does it mean? The doc says it will enforce ASCII mode if I understand correctly. – jdhao Apr 14 '20 at 05:27
  • 1
    @jdhao still non-space is non-space even in enforced ascii mode, this is what you get, does not matter if is unicode, it is non-space even in enforced ascii – Nikos M. Apr 14 '20 at 05:28
  • @NikosM.N I kind of get the point. In ASCII mode, it is equivalent to `[^ \t\n\r\f\v]`. So unicode characters should match. Thanks!!! – jdhao Apr 14 '20 at 05:32

1 Answers1

5

The re.A flag only affects what shorthand character classes match.

In Python 3.x, shorthand character classes are Unicode aware, the Python 2.x re.UNICODE/re.U is ON by default. That means:

  • \d: Matches any Unicode decimal digit (that is, any character in Unicode character category [Nd])
  • \D: Matches any character which is not a decimal digit. (So, all characters other than those in the Nd Unicode category).
  • \w - Matches Unicode word characters; this includes most characters that can be part of a word in any language, as well as numbers and the underscore. (So, \w+ matches each word in a My name is Виктор string)
  • \W - Matches any character which is not a word character. This is the opposite of \w. (So, it will not match any Unicode letter or digit.)
  • \s - Matches Unicode whitespace characters (it will match NEL, hard spaces, etc.)
  • \S - Matches any character which is not a whitespace character. (So, no match for NEL, hard space, etc.)
  • \b - word boundaries match locations between Unicode letters/digits and non-letters/digits or start/end of string.
  • \B - non-word boundaries match locations between two Unicode letters/digits, two non-letters/digits or between a Unicode non-letter/digit and start/end of string.

If you want to disable this behavior, you use re.A or re.ASCII:

Make \w, \W, \b, \B, \d, \D, \s and \S perform ASCII-only matching instead of full Unicode matching. This is only meaningful for Unicode patterns, and is ignored for byte patterns. Corresponds to the inline flag (?a).

That means that:

  • \d = [0-9] - and no longer matches Hindi, Bengali, etc. digits
  • \D = [^0-9] - and matches any characters other than ASCII digits (i.e. it acts as (?u)(?![0-9])\d now)
  • \w = [A-Za-z0-9_] - and it only matches ASCII words now, Wiktor is matched with \w+, but Виктор does not
  • \W = [^A-Za-z0-9_] - it matches any char but ASCII letters/digits/_ (i.e. it matches 你好吗, Виктор, etc.
  • \s = [ \t\n\r\f\v] - matches a regular space, tab, linefeed, carriage return, form feed and a vertical tab
  • \S = [^ \t\n\r\f\v] - matches any char other than a space, tab, linefeed, carriage return, form feed and a vertical tab, so it matches all Unicode letters, digits and punctuation and Unicode (non-ASCII) whitespace. E.g., re.sub(r'\S+', r'{\g<0>}', '\xA0 ', flags=re.A) will return '{ } ', as you see, the \S now matches hard spaces.
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563