4

I have a regex which works perfectly in Python 2:

parts = re.split(r'\s*', re.sub(r'^\s+|\s*$', '', expression)) # split expression into 5 parts

this regex will split an expression into 5 parts, for example,

'a * b   =     c' will be split into ['a', '*', 'b', '=', 'c'],
'11 + 12 = 23' will be split into ['11', '+', '12', '=', '23'],
'ab   - c = d' will be split into ['ab', '-', 'c', '=', 'd'],

etc.

But in Python 3 this regex works quite differently,

'a * b   =     c' will be split into ['', 'a','', '*', '', 'b','', '=', '',  'c', ''],
'11 + 12 = 23' will be split into ['', '1', '1', '', '+', '', '1', '2', '', '=', '', '2', '3', ''],
'ab   - c = d' will be split into ['', 'a', 'b', '', '-', '', 'c', '', '=', '', 'd', ''],

In general, in Python 3, each character in a part will be split into a separate part, and removed spaces(including none existing leading and trailing ) will become an empty part('') and will be added into the part list.

I think this Python 3 regex behavior differs QUITE big with Python 2, could anyone tell me the reason why Python 3 will change this much, and what is the correct regex to split an expression into 5 parts as in Python 2?

DYZ
  • 55,249
  • 10
  • 64
  • 93
dguan
  • 1,023
  • 1
  • 9
  • 21
  • 2
    Splitting on a potentially-zero-length pattern is wrong regardless. Not sure how Python2 does what you're saying. Use `r'\s+'` instead. – o11c Dec 03 '18 at 01:53
  • 1
    it works similar to python2 in version 3.6 though it warns about the non-empty pattern match, but yeah, use split(r'^\s+', – Harald Brinkhof Dec 03 '18 at 01:56
  • @o11c Splitting on non-zero-length patterns is not "wrong", it's quite a useful tool. The fact that Python has not supported this was a poor design decision above anything else, which [luckily got revised in Python 3.7](https://docs.python.org/3/library/re.html#re.split). – Tomalak Dec 03 '18 at 02:12

1 Answers1

4

The ability to split on zero-length matches was added to re.split() in Python 3.7. When you change your split pattern to \s+ instead of \s*, the behavior will be as expected in 3.7+ (and unchanged in Python < 3.7):

def parts(string)
    return re.split(r'\s+', re.sub(r'^\s+|\s*$', '', string))

test:

>>> print(parts('a * b   =     c'))
['a', '*', 'b', '=', 'c']
>>> print(parts('ab   - c = d'))
['ab', '-', 'c', '=', 'd']
>>> print(parts('a * b   =     c'))
['a', '*', 'b', '=', 'c']
>>> print(parts('11 + 12 = 23'))
['11', '+', '12', '=', '23']

The regex module, a drop-in replacement for re, has a "V1" mode that makes existing patterns behave like they did before Python 3.7 (see this answer).

Tomalak
  • 332,285
  • 67
  • 532
  • 628