3

I am expecting a user input string which I need to split into separate words. The user may input text delimited by commas or spaces.

So for instance the text may be:

hello world this is John. or

hello world this is John or even

hello world, this, is John

How can I efficiently parse that text into the following list?

['hello', 'world', 'this', 'is', 'John']

Thanks in advance.

stratis
  • 7,750
  • 13
  • 53
  • 94
  • Tried `r'/\s+/g'` yet? – Mr. Polywhirl Apr 29 '14 at 10:22
  • possible duplicate of [Split string on whitespace in python](http://stackoverflow.com/questions/8113782/split-string-on-whitespace-in-python) – Robin Apr 29 '14 at 10:23
  • Problem is I don't know if the user will use commas or whitespaces. Therefore I need a solution to cover it all. – stratis Apr 29 '14 at 10:25
  • My bad, didn't see the commas. The title is kind of misleading. Have you looked into `re.split`? Where is your current attempt failing? – Robin Apr 29 '14 at 10:28

3 Answers3

4

Use the regular expression: r'[\s,]+' to split on 1 or more white-space characters (\s) or commas (,).

import re

s = 'hello world,    this, is       John'
print re.split(r'[\s,]+', s)

['hello', 'world', 'this', 'is', 'John']

Mr. Polywhirl
  • 42,981
  • 12
  • 84
  • 132
3

Since you need to split based on spaces and other special characters, the best RegEx would be \W+. Quoting from Python re documentation

\W

When the LOCALE and UNICODE flags are not specified, matches any non-alphanumeric character; this is equivalent to the set [^a-zA-Z0-9_]. With LOCALE, it will match any character not in the set [0-9_], and not defined as alphanumeric for the current locale. If UNICODE is set, this will match anything other than [0-9_] plus characters classified as not alphanumeric in the Unicode character properties database.

For Example,

data = "hello world,    this, is       John"
import re
print re.split("\W+", data)
# ['hello', 'world', 'this', 'is', 'John']

Or, if you have the list of special characters by which the string has to be split, you can do

print re.split("[\s,]+", data)

This splits based on any whitespace character (\s) and comma (,).

Community
  • 1
  • 1
thefourtheye
  • 233,700
  • 52
  • 457
  • 497
  • Thank you. Clean and effective solution. However not that only `print re.split("[\s,]+", data)` worked. Maybe it's the fact that I'm under Windows. – stratis Apr 29 '14 at 10:42
  • Yes. \W+ method returned an empty list for me. However the re.split method worked perfectly good. – stratis Apr 29 '14 at 12:00
  • @Konos5 I actually tested it before posting here. So, if you could help me reproduce the problem with some sample data, it would be good :) – thefourtheye Apr 29 '14 at 12:09
1
>>> s = "hello      world this     is            John"
>>> s.split()
['hello', 'world', 'this', 'is', 'John']
>>> s = "hello world, this, is John"
>>> s.split()
['hello', 'world,', 'this,', 'is', 'John']

The first one is correctly parsed by split with no arguments ;)

Then you can :

>>> s = "hello world, this, is John"
>>> def notcoma(ss) :
...     if ss[-1] == ',' :
...             return ss[:-1]
...     else :
...             return ss
... 
>>> map(notcoma, s.split())
['hello', 'world', 'this', 'is', 'John']
Guestar
  • 11
  • 2