1

Assume I have a string,

"I want that one, it is great."

I want to split up this string to be

["I", "want", "that", "one", ",", "it", "is", "great", "."]

Keeping special characters such as ",.:;" and possibly other ones to be treated as a separate word.

Is there any easy way to do this with Python 2.7?

Update

For an example such as "I don't.", it should be ["I", "don", "'", "t", "."]. It would ideally work with non-English punctuations such as ؛ and others.

Mohamed Taher Alrefaie
  • 15,698
  • 9
  • 48
  • 66

4 Answers4

1

See here for a similar question. The answer there applies to you as well:

import re
print re.split('(\W)', "I want that one, it is great.")
print re.split('(\W)', "I don't.")

You can remove the spaces and empty strings returned by re.split using a filter:

s = "I want that one, it is great."
print filter(lambda _: _ not in [' ', ''], re.split('(\W)', s))
Community
  • 1
  • 1
Greg Sadetsky
  • 4,863
  • 1
  • 38
  • 48
1

You can use Regex and a simple list comprehension to do this. The regex will pull out words and separate punctuation, and the list comprehension will remove the blank spaces.

import re
s = "I want that one, it is great. Don't do it."
new_s = [c.strip() for c in re.split('(\W+)', s) if c.strip() != '']
print new_s

The output of new_s will be:

['I', 'want', 'that', 'one', ',', 'it', 'is', 'great', '.', 'Don', "'", 't', 'do', 'it', '.']
Dan
  • 4,488
  • 5
  • 48
  • 75
1
In [70]: re.findall(r"[^,.:;' ]+|[,.:;']", "I want that one, it is great.")
Out[70]: ['I', 'want', 'that', 'one', ',', 'it', 'is', 'great', '.']

In [76]: re.findall(r"[^,.:;' ]+|[,.:;']", "I don't.")
Out[76]: ['I', 'don', "'", 't', '.']

The regex [^,.:;' ]+|[,.:;'] matches (1-or-more characters other than ,, ., :, ;, ' or a literal space), or (the literal characters ,, ., :, ;, or ').


Or, using the regex module, you could easily expand this to include all punctuation and symbols by using the [:punct:] character class:

In [77]: import regex

In Python2:

In [4]: regex.findall(ur"[^[:punct:] ]+|[[:punct:]]", u"""A \N{ARABIC SEMICOLON} B""")
Out[4]: [u'A', u'\u061b', u'B']

In [6]: regex.findall(ur"[^[:punct:] ]+|[[:punct:]]", u"""He said, "I don't!" """)
Out[6]: [u'He', u'said', u',', u'"', u'I', u'don', u"'", u't', u'!', u'"']

In Python3:

In [105]: regex.findall(r"[^[:punct:] ]+|[[:punct:]]", """A \N{ARABIC SEMICOLON} B""")
Out[105]: ['A', '؛', 'B']

In [83]: regex.findall(r"[^[:punct:] ]+|[[:punct:]]", """He said, "I don't!" """)
Out[83]: ['He', 'said', ',', '"', 'I', 'don', "'", 't', '!', '"']

Note that it is important that you pass a unicode as the second argument to regex.findall if you wish [:punct:] to match unicode punctuation or symbols.

In Python2:

import regex
print(regex.findall(r"[^[:punct:] ]+|[[:punct:]]", 'help؛'))
print(regex.findall(ur"[^[:punct:] ]+|[[:punct:]]", u'help؛'))

prints

['help\xd8\x9b']
[u'help', u'\u061b']
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • Will the second solution work for non-English punctuation such as `،` and `؛`? – Mohamed Taher Alrefaie May 25 '16 at 19:12
  • Indeed it does. When applied to unicode, `[:punct:]` is matches any unicode character in the [Punctuation \p{P} or Symbol \p{S} category](http://www.regular-expressions.info/unicode.html). See also, http://www.regular-expressions.info/posixbrackets.html. – unutbu May 25 '16 at 19:21
  • This is going well. But the regex failed on this one `help؛`, the output was just the same string. Any ideas? – Mohamed Taher Alrefaie May 25 '16 at 19:41
  • Are you passing a `unicode` as the second argument to `regex.findall`? I've added an example above showing what happens when a `str` (such as `'help؛'`) is passed instead of a `unicode` (such as `u'help؛'`). If this is the problem, then you can fix it by decoding the `str` with the appropriate encoding (e.g. `'help؛'.decode('utf-8')` ) to make a `unicode`. – unutbu May 25 '16 at 19:50
  • That worked. Thank you very much. How could you be so knowledgeable, patient and funny (I read your "About me") all at the same time? – Mohamed Taher Alrefaie May 25 '16 at 20:35
0

I don't know of any functions that can do this but you could use a for loop.

Something like this: word = "" wordLength = 0 for i in range(0, len(stringName)): if stringName[i] != " ": for x in range((i-wordLength), i): word += stringName[i] wordLength = 0 list.append(word) word = "" else: worldLength = wordlength + 1 Hope this works...sorry if it is not the best way