How do I split a string into a list of words?

Question

How do I split a sentence and store each word in a list? e.g.

"these are words"   ⟶   ["these", "are", "words"]

_{To split on other delimiters, see Split a string by a delimiter in python.}

_{To split into individual characters, see How do I split a string into a list of characters?.}

As it is, you will be printing the full list of words for each word in the list. I think you meant to use `print(word)` as your last line. — tgray, Apr 13 '09 at 14:08

score 545 · Accepted Answer · edited Jul 17 '22 at 07:07

545

Given a string sentence, this stores each word in a list called words:

words = sentence.split()

edited Jul 17 '22 at 07:07

Mateen Ulhaq

24,552
19
101
135

answered Apr 13 '09 at 12:54

nstehr

7,978
2
18
22

3

This does not eliminate the special symbols such as commas and points – AbdelKh Feb 05 '23 at 09:38

score 478 · Answer 2 · edited Jul 17 '22 at 07:10

478

To split the string text on any consecutive runs of whitespace:

words = text.split()

To split the string text on a custom delimiter such as ",":

words = text.split(",")

The words variable will be a list and contain the words from text split on the delimiter.

edited Jul 17 '22 at 07:10

Mateen Ulhaq

24,552
19
101
135

answered Apr 13 '09 at 12:50

zalew

10,171
3
29
32

score 92 · Answer 3 · edited Mar 07 '23 at 14:20

Use str.split():

Return a list of the words in the string, using sep as the delimiter ... If sep is not specified or is None, a different splitting algorithm is applied: runs of consecutive whitespace are regarded as a single separator, and the result will contain no empty strings at the start or end if the string has leading or trailing whitespace.

>>> line = "a sentence with a few words"
>>> line.split()
['a', 'sentence', 'with', 'a', 'few', 'words']

score 61 · Answer 4 · edited Jun 20 '20 at 09:12

61

Depending on what you plan to do with your sentence-as-a-list, you may want to look at the Natural Language Took Kit. It deals heavily with text processing and evaluation. You can also use it to solve your problem:

import nltk
words = nltk.word_tokenize(raw_sentence)

This has the added benefit of splitting out punctuation.

Example:

>>> import nltk
>>> s = "The fox's foot grazed the sleeping dog, waking it."
>>> words = nltk.word_tokenize(s)
>>> words
['The', 'fox', "'s", 'foot', 'grazed', 'the', 'sleeping', 'dog', ',', 
'waking', 'it', '.']

This allows you to filter out any punctuation you don't want and use only words.

Please note that the other solutions using string.split() are better if you don't plan on doing any complex manipulation of the sentence.

[Edited]

edited Jun 20 '20 at 09:12

Community

1
1

answered Apr 13 '09 at 14:24

tgray

8,826
5
36
41

6

`split()` relies on white-space as the separator, so it will fail to separate hyphenated words--and long-dash separated phrases will fail to split too. And if the sentence contains any punctuation without spaces, those will fail to stick. For any real-world text parsing (like for this comment), your nltk suggestion is much better than split()`. – hobs Dec 14 '11 at 13:10
4

Potentially useful, although I wouldn't characterise this as splitting into "words". By any plain English definition, `','` and `"'s"` are not words. Normally, if you wanted to split the sentence above into "words" in a punctuation-aware way, you'd want to strip out the comma and get `"fox's"` as a single word. – Mark Amery Jan 25 '16 at 17:52
1

Python 2.7+ as of April 2016. – AnneTheAgile Sep 20 '16 at 20:57

Colonel Panic · Answer 5 · 2014-04-15T20:35:32.037

38

How about this algorithm? Split text on whitespace, then trim punctuation. This carefully removes punctuation from the edge of words, without harming apostrophes inside words such as we're.

>>> text
"'Oh, you can't help that,' said the Cat: 'we're all mad here. I'm mad. You're mad.'"

>>> text.split()
["'Oh,", 'you', "can't", 'help', "that,'", 'said', 'the', 'Cat:', "'we're", 'all', 'mad', 'here.', "I'm", 'mad.', "You're", "mad.'"]

>>> import string
>>> [word.strip(string.punctuation) for word in text.split()]
['Oh', 'you', "can't", 'help', 'that', 'said', 'the', 'Cat', "we're", 'all', 'mad', 'here', "I'm", 'mad', "You're", 'mad']

edited Apr 15 '14 at 20:35

answered Jul 30 '13 at 15:32

Colonel Panic

132,665
89
401
465

4

Nice, but some English words truly contain trailing punctuation. For example, the trailing dots in `e.g.` and `Mrs.`, and the trailing apostrophe in the possessive `frogs'` (as in `frogs' legs`) are part of the word, but will be stripped by this algorithm. Handling abbreviations correctly can be *roughly* achieved by detecting dot-separated initialisms plus using a dictionary of special cases (like `Mr.`, `Mrs.`). Distinguishing possessive apostrophes from single quotes is dramatically harder, since it requires parsing the grammar of the sentence in which the word is contained. – Mark Amery Jan 29 '16 at 00:02
2

@MarkAmery You're right. It's also since occurred to me that some punctuation marks—such as the em dash—can separate words without spaces. – Colonel Panic Sep 30 '16 at 08:57

score 17 · Answer 6 · edited Aug 04 '22 at 01:38

17

I want my python function to split a sentence (input) and store each word in a list

The str().split() method does this, it takes a string, splits it into a list:

>>> the_string = "this is a sentence"
>>> words = the_string.split(" ")
>>> print(words)
['this', 'is', 'a', 'sentence']
>>> type(words)
<type 'list'> # or <class 'list'> in Python 3.0

edited Aug 04 '22 at 01:38

Karl Knechtel

62,466
11
102
153

answered Apr 13 '09 at 13:46

dbr

165,801
69
278
343

BlackBeard · Answer 7 · 2018-12-10T04:37:40.440

16

If you want all the chars of a word/sentence in a list, do this:

print(list("word"))
#  ['w', 'o', 'r', 'd']


print(list("some sentence"))
#  ['s', 'o', 'm', 'e', ' ', 's', 'e', 'n', 't', 'e', 'n', 'c', 'e']

edited Dec 10 '18 at 04:37

answered Oct 24 '18 at 09:06

BlackBeard

10,246
7
52
62

This answer belongs on https://stackoverflow.com/q/4978787 instead, although it's probably a duplicate of existing answers there. – Karl Knechtel Aug 04 '22 at 01:34

score 15 · Answer 8 · edited Feb 26 '21 at 12:41

15

shlex has a .split() function. It differs from str.split() in that it does not preserve quotes and treats a quoted phrase as a single word:

>>> import shlex
>>> shlex.split("sudo echo 'foo && bar'")
['sudo', 'echo', 'foo && bar']

NB: it works well for Unix-like command line strings. It doesn't work for natural-language processing.

edited Feb 26 '21 at 12:41

Vladimir Obrizan

2,538
2
18
36

answered Nov 28 '13 at 16:33

Tarwin

592
5
11

1

Use with caution, especially for NLP. It will crash on single quote strings like `"It's good."` with `ValueError: No closing quotation` – Igor Aug 09 '20 at 18:09

score 2 · Answer 9 · answered Apr 06 '23 at 16:34

If you want to split a string into a list of words and if the string has punctuations, it's probably advisable to remove them. For example, str.split() the following string as

s = "Hi, these are words; these're, also, words."
words = s.split()
# ['Hi,', 'these', 'are', 'words;', "these're,", 'also,', 'words.']

where Hi,, words;, also, etc. have punctuation attached to them. Python has a built-in string module that has a string of punctuations as an attribute (string.punctuation). One way to get rid of the punctuations is to simply strip them from each word:

import string
words = [w.strip(string.punctuation) for w in s.split()]
# ['Hi', 'these', 'are', 'words', "these're", 'also', 'words']

another is make a comprehensive dictionary of the strings to remove

table = str.maketrans('', '', string.punctuation)
words = s.translate(table).split() 
# ['Hi', 'these', 'are', 'words', 'thesere', 'also', 'words']

It doesn't handle words like these're, so it handle that case nltk.word_tokenize could be used as tgray suggested. Only, filter out the words that consist entirely of punctuation.

import nltk
words = [w for w in nltk.word_tokenize(s) if w not in string.punctuation]
# ['Hi', 'these', 'are', 'words', 'these', "'re", 'also', 'words']

thrinadhn · Answer 10 · 2021-01-27T14:31:32.437

Split the words without without harming apostrophes inside words Please find the input_1 and input_2 Moore's law

def split_into_words(line):
    import re
    word_regex_improved = r"(\w[\w']*\w|\w)"
    word_matcher = re.compile(word_regex_improved)
    return word_matcher.findall(line)

#Example 1

input_1 = "computational power (see Moore's law) and "
split_into_words(input_1)

# output 
['computational', 'power', 'see', "Moore's", 'law', 'and']

#Example 2

input_2 = """Oh, you can't help that,' said the Cat: 'we're all mad here. I'm mad. You're mad."""

split_into_words(input_2)
#output
['Oh',
 'you',
 "can't",
 'help',
 'that',
 'said',
 'the',
 'Cat',
 "we're",
 'all',
 'mad',
 'here',
 "I'm",
 'mad',
 "You're",
 'mad']

How do I split a string into a list of words?

10 Answers10

Linked

Related