I have the following code using scikit-learn to count ngram frequencies:
c = ["data. format", "data are format hello world"]
vectorizer = CountVectorizer(ngram_range=(1,2))
X = vectorizer.fit_transform(c)
terms = vectorizer.get_feature_names_out()
dense = X.todense()
df = pandas.DataFrame(dense, columns=terms)
the problem is that "data format"
is registered as a token even though there is a period in the string ("data. format"
). How can we get CountVectorizer to use punctuation to separate tokens? the documentation says punctuation will be used by default but it's not happening.
The answer to How to use sklearn's CountVectorizerand() to get ngrams that include any punctuation as separate tokens? suggests using a tokenizer from nltk, passing tokenizer=TreebankWordTokenizer().tokenize
to CounterVectorizer
but this actually uses punctuation in the tokens. I want punctuation to be used to separate tokens but not be part of any token.