Converting a String to a List of Words?

Question

I'm trying to convert a string to a list of words using python. I want to take something like the following:

string = 'This is a string, with words!'

Then convert to something like this :

list = ['This', 'is', 'a', 'string', 'with', 'words']

Notice the omission of punctuation and spaces. What would be the fastest way of going about this?

score 108 · Answer 1 · answered Dec 06 '12 at 00:22

108

I think this is the simplest way for anyone else stumbling on this post given the late response:

>>> string = 'This is a string, with words!'
>>> string.split()
['This', 'is', 'a', 'string,', 'with', 'words!']

answered Dec 06 '12 at 00:22

gilgamar

1,137
1
7
2

27

You need to separate and eliminate the punctuation from the words (e.g., "string," and "words!"). As it, this does not meet OP's requirements. – Levon Dec 06 '12 at 00:31

score 106 · Accepted Answer · edited Apr 11 '18 at 15:51

106

Try this:

import re

mystr = 'This is a string, with words!'
wordList = re.sub("[^\w]", " ",  mystr).split()

How it works:

From the docs :

re.sub(pattern, repl, string, count=0, flags=0)

Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl. If the pattern isn’t found, string is returned unchanged. repl can be a string or a function.

so in our case :

pattern is any non-alphanumeric character.

[\w] means any alphanumeric character and is equal to the character set [a-zA-Z0-9_]

a to z, A to Z , 0 to 9 and underscore.

so we match any non-alphanumeric character and replace it with a space .

and then we split() it which splits string by space and converts it to a list

so 'hello-world'

becomes 'hello world'

with re.sub

and then ['hello' , 'world']

after split()

let me know if any doubts come up.

edited Apr 11 '18 at 15:51

Daniel Sam

3
6

answered May 31 '11 at 00:13

Bryan

6,529
2
29
16

Remember to handle apostrophes and hyphens, too, since they're not included in `\w`. – Brōtsyorfuzthrāx Jul 30 '14 at 05:29
2

You may want to handle formatted apostrophes and non-breaking hyphens, too. – Brōtsyorfuzthrāx Jul 30 '14 at 05:57
string.split() is much easier – Ege Apr 19 '21 at 16:33

score 38 · Answer 3 · answered May 31 '11 at 00:15

To do this properly is quite complex. For your research, it is known as word tokenization. You should look at NLTK if you want to see what others have done, rather than starting from scratch:

>>> import nltk
>>> paragraph = u"Hi, this is my first sentence. And this is my second."
>>> sentences = nltk.sent_tokenize(paragraph)
>>> for sentence in sentences:
...     nltk.word_tokenize(sentence)
[u'Hi', u',', u'this', u'is', u'my', u'first', u'sentence', u'.']
[u'And', u'this', u'is', u'my', u'second', u'.']

score 21 · Answer 4 · answered May 31 '11 at 02:19

21

The most simple way:

>>> import re
>>> string = 'This is a string, with words!'
>>> re.findall(r'\w+', string)
['This', 'is', 'a', 'string', 'with', 'words']

answered May 31 '11 at 02:19

JBernardo

32,262
10
90
115

mtrw · Answer 5 · 2011-05-31T00:29:48.067

15

Using string.punctuation for completeness:

import re
import string
x = re.sub('['+string.punctuation+']', '', s).split()

This handles newlines as well.

edited May 31 '11 at 00:29

answered May 31 '11 at 00:24

mtrw

34,200
7
63
71

score 9 · Answer 6 · edited May 31 '11 at 00:26

9

Well, you could use

import re
list = re.sub(r'[.!,;?]', ' ', string).split()

Note that both string and list are names of builtin types, so you probably don't want to use those as your variable names.

edited May 31 '11 at 00:26

martineau

119,623
25
170
301

answered May 31 '11 at 00:10

Cameron

96,106
25
196
225

score 6 · Answer 7 · answered Jun 08 '17 at 09:55

Inspired by @mtrw's answer, but improved to strip out punctuation at word boundaries only:

import re
import string

def extract_words(s):
    return [re.sub('^[{0}]+|[{0}]+$'.format(string.punctuation), '', w) for w in s.split()]

>>> str = 'This is a string, with words!'
>>> extract_words(str)
['This', 'is', 'a', 'string', 'with', 'words']

>>> str = '''I'm a custom-built sentence with "tricky" words like https://stackoverflow.com/.'''
>>> extract_words(str)
["I'm", 'a', 'custom-built', 'sentence', 'with', 'tricky', 'words', 'like', 'https://stackoverflow.com']

score 4 · Answer 8 · answered May 18 '18 at 05:47

4

Personally, I think this is slightly cleaner than the answers provided

def split_to_words(sentence):
    return list(filter(lambda w: len(w) > 0, re.split('\W+', sentence))) #Use sentence.lower(), if needed

answered May 18 '18 at 05:47

Akhil Cherian Verghese

1,311
13
16

score 3 · Answer 9 · answered May 31 '11 at 00:14

3

A regular expression for words would give you the most control. You would want to carefully consider how to deal with words with dashes or apostrophes, like "I'm".

answered May 31 '11 at 00:14

tofutim

22,664
20
87
148

score 1 · Answer 10 · edited Aug 11 '15 at 15:24

1

list=mystr.split(" ",mystr.count(" "))

edited Aug 11 '15 at 15:24

josliber

43,891
12
98
133

answered Aug 11 '15 at 15:14

sanchit

11
1

BenyaR · Answer 11 · 2017-08-12T18:32:07.190

This way you eliminate every special char outside of the alphabet:

def wordsToList(strn):
    L = strn.split()
    cleanL = []
    abc = 'abcdefghijklmnopqrstuvwxyz'
    ABC = abc.upper()
    letters = abc + ABC
    for e in L:
        word = ''
        for c in e:
            if c in letters:
                word += c
        if word != '':
            cleanL.append(word)
    return cleanL

s = 'She loves you, yea yea yea! '
L = wordsToList(s)
print(L)  # ['She', 'loves', 'you', 'yea', 'yea', 'yea']

I'm not sure if this is fast or optimal or even the right way to program.

score 1 · Answer 12 · answered Feb 04 '22 at 12:43

def split_string(string):
    return string.split()

This function will return the list of words of a given string. In this case, if we call the function as follows,

string = 'This is a string, with words!'
split_string(string)

The return output of the function would be

['This', 'is', 'a', 'string,', 'with', 'words!']

score 0 · Answer 13 · answered May 28 '15 at 06:30

0

This is from my attempt on a coding challenge that can't use regex,

outputList = "".join((c if c.isalnum() or c=="'" else ' ') for c in inputStr ).split(' ')

The role of apostrophe seems interesting.

answered May 28 '15 at 06:30

guest201505281433

1

score 0 · Answer 14 · answered Mar 15 '21 at 20:03

Probably not very elegant, but at least you know what's going on.

my_str = "Simple sample, test! is, olny".lower()
my_lst =[]
temp=""
len_my_str = len(my_str)
number_letter_in_data=0
list_words_number=0
for number_letter_in_data in range(0, len_my_str, 1):
    if my_str[number_letter_in_data] in [',', '.', '!', '(', ')', ':', ';', '-']:
        pass
    else:
        if my_str[number_letter_in_data] in [' ']:
            #if you want longer than 3 char words
            if len(temp)>3:
                list_words_number +=1
                my_lst.append(temp)
                temp=""
            else:
                pass
        else:
            temp = temp+my_str[number_letter_in_data]
my_lst.append(temp)
print(my_lst)

What's the point of this solution if there exists a more optimal solution? — Coddy, Feb 21 '22 at 23:53

score -1 · Answer 15 · edited Jun 08 '17 at 09:06

-1

You can try and do this:

tryTrans = string.maketrans(",!", "  ")
str = "This is a string, with words!"
str = str.translate(tryTrans)
listOfWords = str.split()

edited Jun 08 '17 at 09:06

Paulo Freitas

13,194
14
74
96

answered Aug 12 '13 at 13:49

user2675185

19
1

Converting a String to a List of Words?

15 Answers15

Linked

Related