Python 3.7 vs Python 2.7 RegEx Matching Behavior

Question

I looked quite a bit on stack overflow for an answer and nothing pops out. It's still not obvious after reading the link provided but I understand. Perhaps saving this post helps future people who think like I do.

I have reduced my 3.7 vs 2.7 issue down to a very simple code snippet:

import re
myStr = "Mary   had a little lamb.\n"
reg_exp = re.compile('[ \\n\\r]*')
reg_exp.split(myStr)

['', 'M', 'a', 'r', 'y', '', 'h', 'a', 'd', '', 'a', '', 'l', 'i', 't', 't', 'l', 'e', '', 'l', 'a', 'm', 'b', '.', '', '']

In python 2.7 I get full word tokens. I would like to modify the compile line to be greedy * without splitting on characters.

If I don't include the greedy * I get extra spaces.

reg_exp = re.compile('[ \\n\\r]')
reg_exp.split(myStr)

['Mary', '', 'had', 'a', 'little', 'lamb.', '']

I would like to have my cake and eat it too! This is what I want:

['Mary', 'had', 'a', 'little', 'lamb.']

I've tried all sorts of things like various flags. I'm missing something very basic. Can you help? Thanks!

Is it Python 3.7? Actually, what output do you want to get in all cases? — Wiktor Stribiżew, Aug 17 '18 at 12:15
Perhaps you want `+` instead of `*`? As it is, you're allowing the split to occur wherever there are 0 or more spaces, which is *everywhere*. — jasonharper, Aug 17 '18 at 12:29
I tried and tried again after your marking as duplicate. If you would be so kind as to provide the link I could make some progress correctly classifying this question. I got my answer so I'm glad I asked anyway.....Thanks! — Dave G, Aug 17 '18 at 15:30

score 2 · Answer 1 · edited Jun 20 '20 at 09:12

2

`[ \\n\\r]*` matches empty string

So correct behavior is to split after each letter. Python versions prior to 3.7 ignored empty matches, but version 3.7 fixes that.

You want to replace * with +

reg_exp = re.compile('[ \\n\\r]+')

3.6 docs, 3.7 docs

edited Jun 20 '20 at 09:12

Community

1
1

answered Aug 17 '18 at 12:32

pacholik

8,607
9
43
55

Thank you. Works. – Dave G Aug 17 '18 at 15:30

score 0 · Answer 2 · answered Aug 17 '18 at 12:29

0

Use + instead of *.

* will repeat 0 or more times, so it matches on "" and splits each character.

+ will repeat 1 or more times, so it only matches when something is found.

answered Aug 17 '18 at 12:29

matejcik

1,912
16
26

Python 3.7 vs Python 2.7 RegEx Matching Behavior

2 Answers2

[ \\n\\r]* matches empty string

`[ \\n\\r]*` matches empty string