2

I want to split the words in a string but keeping symbols separately too.

s = "Hello world. This-is-foo! I love you"

the output should be

out: ["Hello", "world", ".", "This", "-", "is", "-", "foo", "!", "I", "love", "you"]

I tried:

re.split('(\W)', s)

But this is the output:

['Hello',
 ' ',
 'world',
 '.',
 '',
 ' ',
 'This',
 '-',
 'is',
 '-',
 'foo',
 '!',
 '',
 ' ',
 'I',
 ' ',
 'love',
 ' ',
 'you']

As you can see the spaces are left there. How can I solve this?

user12195705
  • 147
  • 2
  • 10

5 Answers5

3

You may use this regex with findall in python:

>>> s = "Hello world. This-is-foo! I love you"
>>> print ( re.findall( r'\w+|[^\s\w]+', s) )
['Hello', 'world', '.', 'This', '-', 'is', '-', 'foo', '!', 'I', 'love', 'you']

RegEx Demo

RegEx Details:

  • \w+: Match 1 or more word characters
  • |: OR
  • [^\s\w]+; Match 1 or more non-word and non-whitespace characters
anubhava
  • 761,203
  • 64
  • 569
  • 643
2

This regex should work:

re.findall(r'\w+|\S', s)

This represents words and no-whitespace characters.

Code:

import re
s = "Hello world. This-is-foo! I love you"
print(re.findall(r"\w+|[^\w\s]+", s))

Output:

['Hello', 'world', '.', 'This', '-', 'is', '-', 'foo', '!', 'I', 'love', 'you']
Nandu Raj
  • 2,072
  • 9
  • 20
1

You can match the words \w+ or the non-words \W+ (notice the uppercase):

import re

s = "Hello world. This-is-foo! I love you"

print(re.findall(r"\w+|\W+", s))

You get:

['Hello', ' ', 'world', '. ', 'This', '-', 'is', '-', 'foo', '! ', 'I', ' ', 'love', ' ', 'you']

EDIT

If you want to avoid the white spaces, you can do:

import re

s = "Hello world. This-is-foo! I love you"

print(re.findall(r"\w+|[^\w\s]+", s))

You get:

['Hello', 'world', '.', 'This', '-', 'is', '-', 'foo', '!', 'I', 'love', 'you']
Laurent LAPORTE
  • 21,958
  • 6
  • 58
  • 103
1

All words and no-whitespace characters

re.findall(r'\w+|\S', s)
AlexMTX
  • 72
  • 4
0

After that you may filter the spaces using a list comprehension..
s = [x for x in re.split('(\W)', s) if x != " "]
Testing this solution with %%timeit magic shows that it is almost as fast as the most popular answer

Ivan Calderon
  • 580
  • 6
  • 14