How to split string by space and treat special characters as a separate word in Python?

Question

Assume I have a string,

"I want that one, it is great."

I want to split up this string to be

["I", "want", "that", "one", ",", "it", "is", "great", "."]

Keeping special characters such as ",.:;" and possibly other ones to be treated as a separate word.

Is there any easy way to do this with Python 2.7?

Update

For an example such as "I don't.", it should be ["I", "don", "'", "t", "."]. It would ideally work with non-English punctuations such as ؛ and others.

how would you handle words like `"don't"`? Would you have `['don', ''', 't']`? — R Nar, May 25 '16 at 18:46
I am not very experienced in Python, but in C# you would just use the string.Split() method with a character array containing a space and then the special characters — Brady W, May 25 '16 at 18:54

score 1 · Answer 1 · edited May 23 '17 at 11:44

1

See here for a similar question. The answer there applies to you as well:

import re
print re.split('(\W)', "I want that one, it is great.")
print re.split('(\W)', "I don't.")

You can remove the spaces and empty strings returned by re.split using a filter:

s = "I want that one, it is great."
print filter(lambda _: _ not in [' ', ''], re.split('(\W)', s))

edited May 23 '17 at 11:44

Community

1
1

answered May 25 '16 at 18:52

Greg Sadetsky

4,863
1
38
48

Dan · Answer 2 · 2016-05-25T18:56:18.807

1

You can use Regex and a simple list comprehension to do this. The regex will pull out words and separate punctuation, and the list comprehension will remove the blank spaces.

import re
s = "I want that one, it is great. Don't do it."
new_s = [c.strip() for c in re.split('(\W+)', s) if c.strip() != '']
print new_s

The output of new_s will be:

['I', 'want', 'that', 'one', ',', 'it', 'is', 'great', '.', 'Don', "'", 't', 'do', 'it', '.']

edited May 25 '16 at 18:56

answered May 25 '16 at 18:53

Dan

4,488
5
48
75

This is promising but is there a way to trim strings to avoid extra spaces in `", "` – Mohamed Taher Alrefaie May 25 '16 at 18:55
@M-T-A yes, just fixed – Dan May 25 '16 at 18:56
@EoinS I just fixed that – Dan May 25 '16 at 18:59

unutbu · Accepted Answer · 2016-05-25T19:52:04.913

1

In [70]: re.findall(r"[^,.:;' ]+|[,.:;']", "I want that one, it is great.")
Out[70]: ['I', 'want', 'that', 'one', ',', 'it', 'is', 'great', '.']

In [76]: re.findall(r"[^,.:;' ]+|[,.:;']", "I don't.")
Out[76]: ['I', 'don', "'", 't', '.']

The regex [^,.:;' ]+|[,.:;'] matches (1-or-more characters other than ,, ., :, ;, ' or a literal space), or (the literal characters ,, ., :, ;, or ').

Or, using the regex module, you could easily expand this to include all punctuation and symbols by using the [:punct:] character class:

In [77]: import regex

In Python2:

In [4]: regex.findall(ur"[^[:punct:] ]+|[[:punct:]]", u"""A \N{ARABIC SEMICOLON} B""")
Out[4]: [u'A', u'\u061b', u'B']

In [6]: regex.findall(ur"[^[:punct:] ]+|[[:punct:]]", u"""He said, "I don't!" """)
Out[6]: [u'He', u'said', u',', u'"', u'I', u'don', u"'", u't', u'!', u'"']

In Python3:

In [105]: regex.findall(r"[^[:punct:] ]+|[[:punct:]]", """A \N{ARABIC SEMICOLON} B""")
Out[105]: ['A', '؛', 'B']

In [83]: regex.findall(r"[^[:punct:] ]+|[[:punct:]]", """He said, "I don't!" """)
Out[83]: ['He', 'said', ',', '"', 'I', 'don', "'", 't', '!', '"']

Note that it is important that you pass a unicode as the second argument to regex.findall if you wish [:punct:] to match unicode punctuation or symbols.

In Python2:

import regex
print(regex.findall(r"[^[:punct:] ]+|[[:punct:]]", 'help؛'))
print(regex.findall(ur"[^[:punct:] ]+|[[:punct:]]", u'help؛'))

prints

['help\xd8\x9b']
[u'help', u'\u061b']

edited May 25 '16 at 19:52

answered May 25 '16 at 18:58

unutbu

842,883
184
1,785
1,677

Will the second solution work for non-English punctuation such as `،` and `؛`? – Mohamed Taher Alrefaie May 25 '16 at 19:12
Indeed it does. When applied to unicode, `[:punct:]` is matches any unicode character in the [Punctuation \p{P} or Symbol \p{S} category](http://www.regular-expressions.info/unicode.html). See also, http://www.regular-expressions.info/posixbrackets.html. – unutbu May 25 '16 at 19:21
This is going well. But the regex failed on this one `help؛`, the output was just the same string. Any ideas? – Mohamed Taher Alrefaie May 25 '16 at 19:41
Are you passing a `unicode` as the second argument to `regex.findall`? I've added an example above showing what happens when a `str` (such as `'help؛'`) is passed instead of a `unicode` (such as `u'help؛'`). If this is the problem, then you can fix it by decoding the `str` with the appropriate encoding (e.g. `'help؛'.decode('utf-8')` ) to make a `unicode`. – unutbu May 25 '16 at 19:50
That worked. Thank you very much. How could you be so knowledgeable, patient and funny (I read your "About me") all at the same time? – Mohamed Taher Alrefaie May 25 '16 at 20:35

score 0 · Answer 4 · answered May 25 '16 at 18:54

I don't know of any functions that can do this but you could use a for loop.

Something like this: word = "" wordLength = 0 for i in range(0, len(stringName)): if stringName[i] != " ": for x in range((i-wordLength), i): word += stringName[i] wordLength = 0 list.append(word) word = "" else: worldLength = wordlength + 1 Hope this works...sorry if it is not the best way

How to split string by space and treat special characters as a separate word in Python?

Update

4 Answers4