3

I am splitting a string using "Python strings split with multiple separators":

import re
DATA = "Hey, you - what are you doing here!?"
print re.findall(r'\w+', DATA)
# Prints ['Hey', 'you', 'what', 'are', 'you', 'doing', 'here']

I want to get a separate list of of what's in between the matched words:

[", ", " - ", " ", " ", " ", " ", "!?"]

How do I do this?

Community
  • 1
  • 1
jedierikb
  • 12,752
  • 22
  • 95
  • 166

3 Answers3

5
print re.findall(r'\W+', DATA)  # note, UPPER-case "W"

yields the list you are looking for:

[', ', ' - ', ' ', ' ', ' ', ' ', '!?']

I used \W+ rather than \w+ which negates the character class you were using.

   \w  Matches word characters, i.e., letters, digits, and underscores.
   \W  Matches non-word characters, i.e., the negated version of \w

This Regular Expression Reference Sheet might be helpful in selecting the best character classes/meta characters for your regular expression searches/matches. Also, see this tutorial for more information (esp the reference section toward the bottom of the page)

Levon
  • 138,105
  • 33
  • 200
  • 191
3

How about using the complementary regex to \w, \W? Also, instead of getting a separate list, it's probably more efficient to get it all at once. (Although of course it depends what you intend to do with it.)

>>> re.findall(r'(\w+)(\W+)', DATA)
[('Hey', ', '), ('you', ' - '), ('what', ' '), ('are', ' '), ('you', ' '), ('doing', ' '), ('here', '!?')]

If you really want separate lists, just zip it:

>>> zip(*re.findall(r'(\w+)(\W+)', DATA))
[('Hey', 'you', 'what', 'are', 'you', 'doing', 'here'), (', ', ' - ', ' ', ' ', ' ', ' ', '!?')]
kojiro
  • 74,557
  • 19
  • 143
  • 201
0

re.split

import re
DATA = "Hey, you - what are you doing here!?"
print re.split(r'\w+', DATA)
#prints ['', ', ', ' - ', ' ', ' ', ' ', ' ', '!?']

You might also want to filter out empty strings to match what you asked for exactly.

Steven
  • 61
  • 3