1

I know this is probably really easy question, but i'm struggling to split a string in python. My regex has group separators like this:

myRegex = "(\W+)"

And I want to parse this string into words:

testString = "This is my test string, hopefully I can get the word i need"
testAgain = re.split("(\W+)", testString)

Here's the results:

['This', ' ', 'is', ' ', 'my', ' ', 'test', ' ', 'string', ', ', 'hopefully', ' ', 'I', ' ', 'can', ' ', 'get', ' ', 'the', ' ', 'word', ' ', 'i', ' ', 'need']

Which isn't what I expected. I am expecting the list to contain:

['This','is','my','test']......etc

Now I know it's something to do with the grouping in my regex, and I can fix the issue by removing the brackets. But how can I keep the brackets and get the result above?

Sorry about this question, I have read the official python documentation on regex spliting with groups, but I still don't understand why the empty spaces are in my list

Andrew
  • 65
  • 7

2 Answers2

2

As described in this answer, How to split but ignore separators in quoted strings, in python?, you can simply slice the array once it's split. It's easy to do so because you want every other member, starting with the first one (so 1,3,5,7)

You can use the [start:end:step] notation as described below:

testString = "This is my test string, hopefully I can get the word i need"
testAgain = re.split("(\W+)", testString)
testAgain = testAgain[0::2]

Also, I must point out that \W matches any non-word characters, including punctuation. If you want to keep your punctuation, you'll need to change your regex.

Community
  • 1
  • 1
ventsyv
  • 3,316
  • 3
  • 27
  • 49
0

You can simly do:

testAgain = testString.split()  # built-in split with space

Different regex ways of doing this:

testAgain = re.split(r"\s+", testString)   # split with space
testAgain = re.findall(r"\w+", testString) # find all words
testAgain = re.findall(r"\S+", testString) # find all non space characters
lycuid
  • 2,555
  • 1
  • 18
  • 28