2

I wanted to split a sentence on multiple delimiters:

.?!\n

However, I want to keep the comma along with the word. For example for the string

'Hi, How are you?'

I want the result

['Hi,', 'How', 'are', 'you', '?']

I tried the following, but not getting the required result

words = re.findall(r"\w+|\W+", text)
timgeb
  • 76,762
  • 20
  • 123
  • 145
user3535492
  • 87
  • 1
  • 1
  • 6
  • I think the clue might be in your question - try using `re.split`, e.g. `re.split(r'\s+', text)`? – AChampion Feb 28 '16 at 00:40
  • 1
    Are you looking to keep only commas attached to each word? What is your criteria for punctuation? When do you split and when do you not? – idjaw Feb 28 '16 at 00:41
  • If you want to keep the coma, maybe you can try this: re.findall(r"\w+[,]*", t) – ar-ms Feb 28 '16 at 00:45
  • I want to split the sentence on occurrence of white spaces. Since a comma is attached is attached to the word hi, it should be displayed along with hi. However delimiters like "." "!" "?" "newline" which occur at the end of the sentence should be treated as a word and splitted – user3535492 Feb 28 '16 at 00:47
  • 1
    that information should be in your question – idjaw Feb 28 '16 at 00:49
  • words = re.split(r'\s+', text) gives the following as the output: ['hi,', 'how', 'are', 'you?'] It doesn't split the "?" at the end. – user3535492 Feb 28 '16 at 00:50

3 Answers3

4

re.split and keep your delimiters, then filter out the strings which only contain whitespace.

>>> import re
>>> s = 'Hi, How are you?'
>>> [x for x in re.split('(\s|!|\.|\?|\n)', s) if x.strip()]
['Hi,', 'How', 'are', 'you', '?']
timgeb
  • 76,762
  • 20
  • 123
  • 145
2

If using re.findall:

>>> ss = """
... Hi, How are
...
... yo.u
... do!ing?
... """
>>> [ w for w in re.findall('(\w+\,?|[.?!]?)?\s*', ss) if w ]
['Hi,', 'How', 'are', 'yo', '.', 'u', 'do', '!', 'ing', '?']
Quinn
  • 4,394
  • 2
  • 21
  • 19
0

You can use:

re.findall('(.*?)([\s\.\?!\n])', text)

With a bit of itertools magic and list comprehensions:

[i.strip() for i in itertools.chain.from_iterable(re.findall('(.*?)([\s\.\?!\n])', text)) if i.strip()]

And a bit more comprehensible version:

words = []
found = itertools.chain.from_iterable(re.findall('(.*?)([\s\.\?!\n])', text)
for i in found:
    w = i.strip()
    if w:
        words.append(w)
hruske
  • 2,205
  • 19
  • 27
  • With re.findall('(.*?)([\s\.\?!\n])', text), I get the following output: [('hi,', ' '), ('how', ' '), ('are', ' '), ('you', '?')] – user3535492 Feb 28 '16 at 00:56
  • Yes, you have to filter the output next, but see timgeb's answer for much nicer version. – hruske Feb 28 '16 at 00:57