636

How do I split a sentence and store each word in a list? e.g.

"these are words"   ⟶   ["these", "are", "words"]

To split on other delimiters, see Split a string by a delimiter in python.

To split into individual characters, see How do I split a string into a list of characters?.

cottontail
  • 10,268
  • 18
  • 50
  • 51
Thanx
  • 7,403
  • 5
  • 21
  • 12
  • 5
    As it is, you will be printing the full list of words for each word in the list. I think you meant to use `print(word)` as your last line. – tgray Apr 13 '09 at 14:08

10 Answers10

545

Given a string sentence, this stores each word in a list called words:

words = sentence.split()
Mateen Ulhaq
  • 24,552
  • 19
  • 101
  • 135
nstehr
  • 7,978
  • 2
  • 18
  • 22
478

To split the string text on any consecutive runs of whitespace:

words = text.split()      

To split the string text on a custom delimiter such as ",":

words = text.split(",")   

The words variable will be a list and contain the words from text split on the delimiter.

Mateen Ulhaq
  • 24,552
  • 19
  • 101
  • 135
zalew
  • 10,171
  • 3
  • 29
  • 32
92

Use str.split():

Return a list of the words in the string, using sep as the delimiter ... If sep is not specified or is None, a different splitting algorithm is applied: runs of consecutive whitespace are regarded as a single separator, and the result will contain no empty strings at the start or end if the string has leading or trailing whitespace.

>>> line = "a sentence with a few words"
>>> line.split()
['a', 'sentence', 'with', 'a', 'few', 'words']
wovano
  • 4,543
  • 5
  • 22
  • 49
gimel
  • 83,368
  • 10
  • 76
  • 104
61

Depending on what you plan to do with your sentence-as-a-list, you may want to look at the Natural Language Took Kit. It deals heavily with text processing and evaluation. You can also use it to solve your problem:

import nltk
words = nltk.word_tokenize(raw_sentence)

This has the added benefit of splitting out punctuation.

Example:

>>> import nltk
>>> s = "The fox's foot grazed the sleeping dog, waking it."
>>> words = nltk.word_tokenize(s)
>>> words
['The', 'fox', "'s", 'foot', 'grazed', 'the', 'sleeping', 'dog', ',', 
'waking', 'it', '.']

This allows you to filter out any punctuation you don't want and use only words.

Please note that the other solutions using string.split() are better if you don't plan on doing any complex manipulation of the sentence.

[Edited]

Community
  • 1
  • 1
tgray
  • 8,826
  • 5
  • 36
  • 41
  • 6
    `split()` relies on white-space as the separator, so it will fail to separate hyphenated words--and long-dash separated phrases will fail to split too. And if the sentence contains any punctuation without spaces, those will fail to stick. For any real-world text parsing (like for this comment), your nltk suggestion is much better than split()`. – hobs Dec 14 '11 at 13:10
  • 4
    Potentially useful, although I wouldn't characterise this as splitting into "words". By any plain English definition, `','` and `"'s"` are not words. Normally, if you wanted to split the sentence above into "words" in a punctuation-aware way, you'd want to strip out the comma and get `"fox's"` as a single word. – Mark Amery Jan 25 '16 at 17:52
  • 1
    Python 2.7+ as of April 2016. – AnneTheAgile Sep 20 '16 at 20:57
38

How about this algorithm? Split text on whitespace, then trim punctuation. This carefully removes punctuation from the edge of words, without harming apostrophes inside words such as we're.

>>> text
"'Oh, you can't help that,' said the Cat: 'we're all mad here. I'm mad. You're mad.'"

>>> text.split()
["'Oh,", 'you', "can't", 'help', "that,'", 'said', 'the', 'Cat:', "'we're", 'all', 'mad', 'here.', "I'm", 'mad.', "You're", "mad.'"]

>>> import string
>>> [word.strip(string.punctuation) for word in text.split()]
['Oh', 'you', "can't", 'help', 'that', 'said', 'the', 'Cat', "we're", 'all', 'mad', 'here', "I'm", 'mad', "You're", 'mad']
Colonel Panic
  • 132,665
  • 89
  • 401
  • 465
  • 4
    Nice, but some English words truly contain trailing punctuation. For example, the trailing dots in `e.g.` and `Mrs.`, and the trailing apostrophe in the possessive `frogs'` (as in `frogs' legs`) are part of the word, but will be stripped by this algorithm. Handling abbreviations correctly can be *roughly* achieved by detecting dot-separated initialisms plus using a dictionary of special cases (like `Mr.`, `Mrs.`). Distinguishing possessive apostrophes from single quotes is dramatically harder, since it requires parsing the grammar of the sentence in which the word is contained. – Mark Amery Jan 29 '16 at 00:02
  • 2
    @MarkAmery You're right. It's also since occurred to me that some punctuation marks—such as the em dash—can separate words without spaces. – Colonel Panic Sep 30 '16 at 08:57
17

I want my python function to split a sentence (input) and store each word in a list

The str().split() method does this, it takes a string, splits it into a list:

>>> the_string = "this is a sentence"
>>> words = the_string.split(" ")
>>> print(words)
['this', 'is', 'a', 'sentence']
>>> type(words)
<type 'list'> # or <class 'list'> in Python 3.0
Karl Knechtel
  • 62,466
  • 11
  • 102
  • 153
dbr
  • 165,801
  • 69
  • 278
  • 343
16

If you want all the chars of a word/sentence in a list, do this:

print(list("word"))
#  ['w', 'o', 'r', 'd']


print(list("some sentence"))
#  ['s', 'o', 'm', 'e', ' ', 's', 'e', 'n', 't', 'e', 'n', 'c', 'e']
BlackBeard
  • 10,246
  • 7
  • 52
  • 62
  • This answer belongs on https://stackoverflow.com/q/4978787 instead, although it's probably a duplicate of existing answers there. – Karl Knechtel Aug 04 '22 at 01:34
15

shlex has a .split() function. It differs from str.split() in that it does not preserve quotes and treats a quoted phrase as a single word:

>>> import shlex
>>> shlex.split("sudo echo 'foo && bar'")
['sudo', 'echo', 'foo && bar']

NB: it works well for Unix-like command line strings. It doesn't work for natural-language processing.

Vladimir Obrizan
  • 2,538
  • 2
  • 18
  • 36
Tarwin
  • 592
  • 5
  • 11
  • 1
    Use with caution, especially for NLP. It will crash on single quote strings like `"It's good."` with `ValueError: No closing quotation` – Igor Aug 09 '20 at 18:09
2

If you want to split a string into a list of words and if the string has punctuations, it's probably advisable to remove them. For example, str.split() the following string as

s = "Hi, these are words; these're, also, words."
words = s.split()
# ['Hi,', 'these', 'are', 'words;', "these're,", 'also,', 'words.']

where Hi,, words;, also, etc. have punctuation attached to them. Python has a built-in string module that has a string of punctuations as an attribute (string.punctuation). One way to get rid of the punctuations is to simply strip them from each word:

import string
words = [w.strip(string.punctuation) for w in s.split()]
# ['Hi', 'these', 'are', 'words', "these're", 'also', 'words']

another is make a comprehensive dictionary of the strings to remove

table = str.maketrans('', '', string.punctuation)
words = s.translate(table).split() 
# ['Hi', 'these', 'are', 'words', 'thesere', 'also', 'words']

It doesn't handle words like these're, so it handle that case nltk.word_tokenize could be used as tgray suggested. Only, filter out the words that consist entirely of punctuation.

import nltk
words = [w for w in nltk.word_tokenize(s) if w not in string.punctuation]
# ['Hi', 'these', 'are', 'words', 'these', "'re", 'also', 'words']
cottontail
  • 10,268
  • 18
  • 50
  • 51
1

Split the words without without harming apostrophes inside words Please find the input_1 and input_2 Moore's law

def split_into_words(line):
    import re
    word_regex_improved = r"(\w[\w']*\w|\w)"
    word_matcher = re.compile(word_regex_improved)
    return word_matcher.findall(line)

#Example 1

input_1 = "computational power (see Moore's law) and "
split_into_words(input_1)

# output 
['computational', 'power', 'see', "Moore's", 'law', 'and']

#Example 2

input_2 = """Oh, you can't help that,' said the Cat: 'we're all mad here. I'm mad. You're mad."""

split_into_words(input_2)
#output
['Oh',
 'you',
 "can't",
 'help',
 'that',
 'said',
 'the',
 'Cat',
 "we're",
 'all',
 'mad',
 'here',
 "I'm",
 'mad',
 "You're",
 'mad']
thrinadhn
  • 1,673
  • 22
  • 32