Python regex splitting on multiple whitespaces

Question

I am expecting a user input string which I need to split into separate words. The user may input text delimited by commas or spaces.

So for instance the text may be:

hello world this is John. or

hello world this is John or even

hello world, this, is John

How can I efficiently parse that text into the following list?

['hello', 'world', 'this', 'is', 'John']

Thanks in advance.

possible duplicate of [Split string on whitespace in python](http://stackoverflow.com/questions/8113782/split-string-on-whitespace-in-python) — Robin, Apr 29 '14 at 10:23
Problem is I don't know if the user will use commas or whitespaces. Therefore I need a solution to cover it all. — stratis, Apr 29 '14 at 10:25
My bad, didn't see the commas. The title is kind of misleading. Have you looked into `re.split`? Where is your current attempt failing? — Robin, Apr 29 '14 at 10:28

Mr. Polywhirl · Answer 1 · 2014-04-29T10:30:43.893

4

Use the regular expression: r'[\s,]+' to split on 1 or more white-space characters (\s) or commas (,).

import re

s = 'hello world,    this, is       John'
print re.split(r'[\s,]+', s)

['hello', 'world', 'this', 'is', 'John']

edited Apr 29 '14 at 10:30

answered Apr 29 '14 at 10:24

Mr. Polywhirl

42,981
12
84
132

score 3 · Accepted Answer · edited Jun 20 '20 at 09:12

3

Since you need to split based on spaces and other special characters, the best RegEx would be \W+. Quoting from Python re documentation

\W

When the LOCALE and UNICODE flags are not specified, matches any non-alphanumeric character; this is equivalent to the set [^a-zA-Z0-9_]. With LOCALE, it will match any character not in the set [0-9_], and not defined as alphanumeric for the current locale. If UNICODE is set, this will match anything other than [0-9_] plus characters classified as not alphanumeric in the Unicode character properties database.

For Example,

data = "hello world,    this, is       John"
import re
print re.split("\W+", data)
# ['hello', 'world', 'this', 'is', 'John']

Or, if you have the list of special characters by which the string has to be split, you can do

print re.split("[\s,]+", data)

This splits based on any whitespace character (\s) and comma (,).

edited Jun 20 '20 at 09:12

Community

1
1

answered Apr 29 '14 at 10:26

thefourtheye

233,700
52
457
497

Thank you. Clean and effective solution. However not that only `print re.split("[\s,]+", data)` worked. Maybe it's the fact that I'm under Windows. – stratis Apr 29 '14 at 10:42
Yes. \W+ method returned an empty list for me. However the re.split method worked perfectly good. – stratis Apr 29 '14 at 12:00
@Konos5 I actually tested it before posting here. So, if you could help me reproduce the problem with some sample data, it would be good :) – thefourtheye Apr 29 '14 at 12:09

Guestar · Answer 3 · 2014-04-29T10:35:11.153

1

>>> s = "hello      world this     is            John"
>>> s.split()
['hello', 'world', 'this', 'is', 'John']
>>> s = "hello world, this, is John"
>>> s.split()
['hello', 'world,', 'this,', 'is', 'John']

The first one is correctly parsed by split with no arguments ;)

Then you can :

>>> s = "hello world, this, is John"
>>> def notcoma(ss) :
...     if ss[-1] == ',' :
...             return ss[:-1]
...     else :
...             return ss
... 
>>> map(notcoma, s.split())
['hello', 'world', 'this', 'is', 'John']

edited Apr 29 '14 at 10:35

answered Apr 29 '14 at 10:27

Guestar

11
2

He has to split based on special characters as well – thefourtheye Apr 29 '14 at 10:29

Python regex splitting on multiple whitespaces

3 Answers3

\W