-1

Using Python 3.x, I need to split a sentence up into individual words and punctuation.

e.g.\ "This is a sentence."

split up into

["This", "is", "a", "sentence", "."]

I'm trying to match words using a for loop, however if I try to match the word "sentence" it doesn't match because when I .split() on whitespace it makes it "sentence." instead of "sentence" and won't match because of the punctuation. What would be the best way to go about doing this?

  • 2
    Add punctuation to the list of tokens to split on. Should be a regex: http://stackoverflow.com/questions/10974932/python-split-string-based-on-regular-expression – duffymo Jul 18 '16 at 12:18
  • 1
    Use a tokenizer: http://www.nltk.org/api/nltk.tokenize.html – ayhan Jul 18 '16 at 12:19
  • Indeed, don't try to reinvent the wheel: the [nltk toolkit's Punkt tokenizer](http://www.nltk.org/api/nltk.tokenize.html#module-nltk.tokenize.punkt) works fairly well. – Nander Speerstra Jul 18 '16 at 12:20
  • And check out this [question/answer](http://stackoverflow.com/questions/15057945/how-do-i-tokenize-a-string-sentence-in-nltk) which gives an example of the nltk tokenizer. – Nander Speerstra Jul 18 '16 at 12:21

1 Answers1

0

Use split(" .,:") and whatever other separators you'd like.

Clusty
  • 117
  • 7
  • with `s = "this is also a sentence, really"`, `s.split(" .,;")` will result in `['this is also a sentence, really']`, because you split on the combination " .,;". – Nander Speerstra Jul 18 '16 at 12:35