I'm trying to extract a vocabulary of unigrams, bigrams, and trigrams using SkLearn's TfidfVectorizer. This is my current code:
max_df_param = .003
use_idf = True
vectorizer = TfidfVectorizer(max_df = max_df_param, stop_words='english', ngram_range=(1,1), max_features=2000, use_idf=use_idf)
X = vectorizer.fit_transform(dataframe[column])
unigrams = vectorizer.get_feature_names()
vectorizer = TfidfVectorizer(max_df = max_df_param, stop_words='english', ngram_range=(2,2), max_features=max(1, int(len(unigrams)/10)), use_idf=use_idf)
X = vectorizer.fit_transform(dataframe[column])
bigrams = vectorizer.get_feature_names()
vectorizer = TfidfVectorizer(max_df = max_df_param, stop_words='english', ngram_range=(3,3), max_features=max(1, int(len(unigrams)/10)), use_idf=use_idf)
X = vectorizer.fit_transform(dataframe[column])
trigrams = vectorizer.get_feature_names()
vocab = np.concatenate((unigrams, bigrams, trigrams))
However, I would like to avoid numbers and words that contain numbers and the current output contains terms such as "0 101 110 12 15th 16th 180c 180d 18th 190 1900 1960s 197 1980 1b 20 200 200a 2d 3d 416 4th 50 7a 7b"
I try to only include words with alphabetical characters using the token_pattern
parameter with the following regex:
vectorizer = TfidfVectorizer(max_df = max_df_param,
token_pattern=u'(?u)\b\^[A-Za-z]+$\b',
stop_words='english', ngram_range=(1,1), max_features=2000, use_idf=use_idf)
but this returns: ValueError: empty vocabulary; perhaps the documents only contain stop words
I have also tried only removing numbers but I still get the same error.
Is my regex incorrect? or am I using the TfidfVectorizer
incorrectly? (I have also tried removing max_features
argument)
Thank you!