I'm tokenizing text with nltk, just sentences fed to wordpunct_tokenizer. This splits contractions (e.g. 'don't' to 'don' +" ' "+'t') but I want to keep them as one word. I'm refining my methods for a more measured and precise tokenization of text, so I need to delve deeper into the nltk tokenization module beyond simple tokenization.
I'm guessing this is common and I'd like feedback from others who've maybe had to deal with the particular issue before.
edit:
Yeah this a general, splattershot question I know
Also, as a novice to nlp, do I need to worry about contractions at all?
EDIT:
The SExprTokenizer or TreeBankWordTokenizer seems to do what I'm looking for for now.