0

I am currently fiddling around with Regular Expressions and NLTK (Natural Language Toolkit). I want to tokenize sentences into words and punctuation. Contractions like "can't", "I'll" and so on should be recognised as words as well. I can't seem to find a regular expression that does this.

\w+(\'\w+)?|[!-~]

Why doesn't this regex work? I only get bad results like:

['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']

For sentences like these:

This is a test. Lulz another sentence. This can't be real.

I am afraid that I haven't understood Regular Expressions?

EDIT:

Code:

import re

re.findall("\w+('\w+)?|[!-~]", "This is a test. Lulz another sentence. This can't be real.")
Dwagner
  • 237
  • 4
  • 11

0 Answers0