how to get scikit-learn's tokenizer to use punctuation to separate tokens?

Question

I have the following code using scikit-learn to count ngram frequencies:

c = ["data. format", "data are format hello world"]
vectorizer = CountVectorizer(ngram_range=(1,2))
X = vectorizer.fit_transform(c)
terms = vectorizer.get_feature_names_out()
dense = X.todense()
df = pandas.DataFrame(dense, columns=terms)

the problem is that "data format" is registered as a token even though there is a period in the string ("data. format"). How can we get CountVectorizer to use punctuation to separate tokens? the documentation says punctuation will be used by default but it's not happening.

The answer to How to use sklearn's CountVectorizerand() to get ngrams that include any punctuation as separate tokens? suggests using a tokenizer from nltk, passing tokenizer=TreebankWordTokenizer().tokenize to CounterVectorizer but this actually uses punctuation in the tokens. I want punctuation to be used to separate tokens but not be part of any token.

I'm having trouble understanding. You don't want `"data format"` as a token. But it's not a token. It's a bi-gram of two tokens, which you've explicitly asked for (`ngram_range=(1,2)`). Is your goal to get all bi-grams except for ones which are separated by a punctuation mark? — Nick ODell, May 09 '23 at 01:34
@NickODell i want punctuation marks such as period to signal that the things on either side should never be used, even as bigrams. if only "x. y" occurs in the corpus but never "x y" then it shouldn't be used — user20c, May 09 '23 at 02:18

DataJanitor · Answer 1 · 2023-05-09T14:10:31.080

You need to use a custom tokenizer. This code shows "x" and "y" separately, but not "x y" since they are not part of the same sentence:

from sklearn.feature_extraction.text import CountVectorizer
import pandas
import re
from nltk.util import ngrams

# Define a custom tokenizer.
def custom_tokenizer(text):
    # Split the text into sentences.
    sentences = re.split(r"\.\s|\.\n", text)
    
    # Initialize an empty list to store the tokens.
    tokens = []
    
    # For each sentence...
    for sentence in sentences:
        # Split the sentence into words.
        words = sentence.split()
        
        # Add the words (1-grams) to the list of tokens.
        tokens.extend(words)
        
        # Add the bigrams to the list of tokens.
        bigrams = ngrams(words, 2)
        tokens.extend([' '.join(bigram) for bigram in bigrams])
    
    return tokens

c = ["x. y", "a b c d"]

# Create a CountVectorizer with the custom tokenizer.
vectorizer = CountVectorizer(tokenizer=custom_tokenizer)

# Apply the CountVectorizer to the data.
X = vectorizer.fit_transform(c)

# Get the feature names.
terms = vectorizer.get_feature_names_out()

# Convert the result to a dense format.
dense = X.todense()

# Convert the result to a DataFrame.
df = pandas.DataFrame(dense, columns=terms)

print(df)

This gives:

	a	a b	b	b c	c	c d	d	x	y
0	0	0	0	0	0	0	0	1	1
1	1	1	1	1	1	1	1	0	0

how to get scikit-learn's tokenizer to use punctuation to separate tokens?

1 Answers1