20

I'm trying to divide a string into words, removing spaces and punctuation marks.

I tried using the split() method, passing all the punctuation at once, but my results were incorrect:

>>> test='hello,how are you?I am fine,thank you. And you?'
>>> test.split(' ,.?')
['hello,how are you?I am fine,thank you. And you?']

I actually know how to do this with regexes already, but I'd like to figure out how to do it using split(). Please don't give me a regex solution.

jscs
  • 63,694
  • 13
  • 151
  • 195
leisurem
  • 203
  • 1
  • 2
  • 6
  • 2
    So you insist on using a wrench to drive a nail, while the hammer is at hand. Why? – Sven Marnach Mar 21 '12 at 01:24
  • Without meaning any disrespect to the OP I think there should be a tag for these kind of questions in which the adequate tool is snubbed for whatever reason (sometimes valid), they come up from time to time. Perhaps `luddism`? – Eduardo Ivanec Mar 21 '12 at 01:35
  • try C# "hello,how are you?I am fine,thank you. And you?".Split(",? .".ToCharArray(), StringSplitOptions.RemoveEmptyEntries); – Ray Cheng Mar 21 '12 at 01:43
  • 8
    Don't let anyone discourage you from exploring non-regex approaches for simple text manipulation. Using string methods, itertools.groupby, and actually writing functions (!), some of us manage to get by almost never using regexes, and in exchange for a few more keystrokes we get to write nice, clean, easy-to-debug Python. – DSM Mar 21 '12 at 01:56

7 Answers7

24

If you want to split a string based on multiple delimiters, as in your example, you're going to need to use the re module despite your bizarre objections, like this:

>>> re.split('[?.,]', test)
['hello', 'how are you', 'I am fine', 'thank you', ' And you', '']

It's possible to get a similar result using split, but you need to call split once for every character, and you need to iterate over the results of the previous split. This works but it's u-g-l-y:

>>> sum([z.split() 
... for z in sum([y.split('?') 
... for y in sum([x.split('.') 
... for x in test.split(',')],[])], [])], [])
['hello', 'how', 'are', 'you', 'I', 'am', 'fine', 'thank', 'you', 'And', 'you']

This uses sum() to flatten the list returned by the previous iteration.

larsks
  • 277,717
  • 41
  • 399
  • 399
  • Please don't use `sum()` to flatten lists of lists -- [it's the wrong tool for this purpose](http://stackoverflow.com/questions/952914/making-a-flat-list-out-of-list-of-lists-in-python/952952#952952). In this particular case even more so, since a [single list comprehension using a nested loop](http://ideone.com/xEXX7) would eliminate the necessity to flatten in the first place. – Sven Marnach Mar 21 '12 at 12:39
  • You are more than welcome to post an alternate solution if you believe it to be more suitable to the problem. – larsks Mar 21 '12 at 13:04
  • As long as the OP does not explain why `re` shouldn't be used, I won't post an answer, since I don't understand the purpose of the question yet. The second link in my last comment shows an alternate solution, though. – Sven Marnach Mar 21 '12 at 13:25
20

This is the best way I can think of without using the re module:

"".join((char if char.isalpha() else " ") for char in test).split()
Elias Zamaria
  • 96,623
  • 33
  • 114
  • 148
10

Since you don't want to use the re module, you can use this:

 test.replace(',',' ').replace('.',' ').replace('?',' ').split()
Thanasis Petsas
  • 4,378
  • 5
  • 31
  • 57
  • test='hello,how are you?I am fine,thank you. And you?' for x in test: if not x.isalpha():test=test.replace(x,' ') test=test.split() print test – leisurem Mar 23 '12 at 06:07
7

A modified version of larsks' answer, where you don't need to type all punctuation characters yourself:

import re, string

re.split("[" + string.punctuation + "]+", test)
['hello', 'how are you', 'I am fine', 'thank you', ' And you', '']
MERose
  • 4,048
  • 7
  • 53
  • 79
6

You can write a function to extend usage of .split():

def multi_split(s, seprators):
    buf = [s]
    for sep in seprators:
        for loop, text in enumerate(buf):
            buf[loop:loop+1] = [i for i in text.split(sep) if i]
    return buf

And try it:

>>> multi_split('hello,how are you?I am fine,thank you. And you?', ' ,.?') ['hello', 'how', 'are', 'you', 'I', 'am', 'fine', 'thank', 'you', 'And', 'you']

This will be clearer and can be used in other situations.

Reorx
  • 2,801
  • 2
  • 24
  • 29
0

Apologies for necroing - this thread comes up as the first result for non-regex splitting of a sentence. Seeing as I had to come up with a non Python-specific method for my students, and that this thread didn't answer my question, I thought I would share just in case.

The point of the code is to use no libraries (and it's quick on large files):

sentence = "George Bernard-Shaw was a fine chap, I'm sure - who can really say?"
alpha = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"
words = []
word = ""
mode = 0
for ch in sentence:
    if mode == 1:
        words.append(word)
        word = ""
        mode = 0
    if ch in alpha or ch == "'" or ch == "-":
        word += ch
    else:
        mode = 1
words.append(word)
print(words)

Output:

['George', 'Bernard-Shaw', 'was', 'a', 'fine', 'chap', "I'm", 'sure', '-', 'who', 'can', 'really', 'say']

I have literally just written this in about half an hour so I'm sure the logic could be cleaned up. I have also acknowledged that it may require additional logic to deal with caveats such as hyphens correctly, as their use is inconsistent compared to something like an inverted comma. Is there any module, indeed, that can do this correctly anyway?

0

A simple way to keep punctuation or other delimiters is:

import re

test='hello,how are you?I am fine,thank you. And you?'

re.findall('[^.?,]+.?', test)

Result:

['hello,', 'how are you?', 'I am fine,', 'thank you.', ' And you?']

maybe this can help someone.

kareem20
  • 21
  • 6