I am currently fiddling around with Regular Expressions and NLTK (Natural Language Toolkit). I want to tokenize sentences into words and punctuation. Contractions like "can't", "I'll" and so on should be recognised as words as well. I can't seem to find a regular expression that does this.
\w+(\'\w+)?|[!-~]
Why doesn't this regex work? I only get bad results like:
['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']
For sentences like these:
This is a test. Lulz another sentence. This can't be real.
I am afraid that I haven't understood Regular Expressions?
EDIT:
Code:
import re
re.findall("\w+('\w+)?|[!-~]", "This is a test. Lulz another sentence. This can't be real.")