0

I looked quite a bit on stack overflow for an answer and nothing pops out. It's still not obvious after reading the link provided but I understand. Perhaps saving this post helps future people who think like I do.

I have reduced my 3.7 vs 2.7 issue down to a very simple code snippet:

import re
myStr = "Mary   had a little lamb.\n"
reg_exp = re.compile('[ \\n\\r]*')
reg_exp.split(myStr)

['', 'M', 'a', 'r', 'y', '', 'h', 'a', 'd', '', 'a', '', 'l', 'i', 't', 't', 'l', 'e', '', 'l', 'a', 'm', 'b', '.', '', '']

In python 2.7 I get full word tokens. I would like to modify the compile line to be greedy * without splitting on characters.

If I don't include the greedy * I get extra spaces.

reg_exp = re.compile('[ \\n\\r]')
reg_exp.split(myStr)

['Mary', '', 'had', 'a', 'little', 'lamb.', '']

I would like to have my cake and eat it too! This is what I want:

['Mary', 'had', 'a', 'little', 'lamb.']

I've tried all sorts of things like various flags. I'm missing something very basic. Can you help? Thanks!

Dave G
  • 19
  • 3
  • Is it Python 3.7? Actually, what output do you want to get in all cases? – Wiktor Stribiżew Aug 17 '18 at 12:15
  • Perhaps you want `+` instead of `*`? As it is, you're allowing the split to occur wherever there are 0 or more spaces, which is *everywhere*. – jasonharper Aug 17 '18 at 12:29
  • I tried and tried again after your marking as duplicate. If you would be so kind as to provide the link I could make some progress correctly classifying this question. I got my answer so I'm glad I asked anyway.....Thanks! – Dave G Aug 17 '18 at 15:30

2 Answers2

2

[ \\n\\r]* matches empty string

So correct behavior is to split after each letter. Python versions prior to 3.7 ignored empty matches, but version 3.7 fixes that.

You want to replace * with +

reg_exp = re.compile('[ \\n\\r]+')

3.6 docs, 3.7 docs

Community
  • 1
  • 1
pacholik
  • 8,607
  • 9
  • 43
  • 55
0

Use + instead of *.

* will repeat 0 or more times, so it matches on "" and splits each character.

+ will repeat 1 or more times, so it only matches when something is found.

matejcik
  • 1,912
  • 16
  • 26