If you want to split a string into a list of words and if the string has punctuations, it's probably advisable to remove them. For example, str.split()
the following string as
s = "Hi, these are words; these're, also, words."
words = s.split()
# ['Hi,', 'these', 'are', 'words;', "these're,", 'also,', 'words.']
where Hi,
, words;
, also,
etc. have punctuation attached to them. Python has a built-in string
module that has a string of punctuations as an attribute (string.punctuation
). One way to get rid of the punctuations is to simply strip them from each word:
import string
words = [w.strip(string.punctuation) for w in s.split()]
# ['Hi', 'these', 'are', 'words', "these're", 'also', 'words']
another is make a comprehensive dictionary of the strings to remove
table = str.maketrans('', '', string.punctuation)
words = s.translate(table).split()
# ['Hi', 'these', 'are', 'words', 'thesere', 'also', 'words']
It doesn't handle words like these're
, so it handle that case nltk.word_tokenize
could be used as tgray suggested. Only, filter out the words that consist entirely of punctuation.
import nltk
words = [w for w in nltk.word_tokenize(s) if w not in string.punctuation]
# ['Hi', 'these', 'are', 'words', 'these', "'re", 'also', 'words']