5

I'm trying to extract a vocabulary of unigrams, bigrams, and trigrams using SkLearn's TfidfVectorizer. This is my current code:

 max_df_param =  .003
 use_idf = True

 vectorizer = TfidfVectorizer(max_df = max_df_param, stop_words='english', ngram_range=(1,1), max_features=2000, use_idf=use_idf)
 X = vectorizer.fit_transform(dataframe[column])
 unigrams = vectorizer.get_feature_names()

 vectorizer = TfidfVectorizer(max_df = max_df_param, stop_words='english', ngram_range=(2,2), max_features=max(1, int(len(unigrams)/10)), use_idf=use_idf)
 X = vectorizer.fit_transform(dataframe[column])
 bigrams = vectorizer.get_feature_names()

 vectorizer = TfidfVectorizer(max_df = max_df_param, stop_words='english', ngram_range=(3,3), max_features=max(1, int(len(unigrams)/10)), use_idf=use_idf)
 X = vectorizer.fit_transform(dataframe[column])
 trigrams = vectorizer.get_feature_names()

 vocab = np.concatenate((unigrams, bigrams, trigrams))

However, I would like to avoid numbers and words that contain numbers and the current output contains terms such as "0 101 110 12 15th 16th 180c 180d 18th 190 1900 1960s 197 1980 1b 20 200 200a 2d 3d 416 4th 50 7a 7b"

I try to only include words with alphabetical characters using the token_pattern parameter with the following regex:

vectorizer = TfidfVectorizer(max_df = max_df_param, 
                            token_pattern=u'(?u)\b\^[A-Za-z]+$\b', 
                            stop_words='english', ngram_range=(1,1), max_features=2000, use_idf=use_idf)

but this returns: ValueError: empty vocabulary; perhaps the documents only contain stop words

I have also tried only removing numbers but I still get the same error.

Is my regex incorrect? or am I using the TfidfVectorizer incorrectly? (I have also tried removing max_features argument)

Thank you!

Matt
  • 463
  • 2
  • 9
  • 18

1 Answers1

6

Thats because your regex is wrong.

1) You are using ^ and $ which are used to denote string start and end. That means this pattern will only match complete string with only alphabets in it (no numbers, no spaces, no other special chars). You dont want that. So remove that.

See the details about special characters here: https://docs.python.org/3/library/re.html#regular-expression-syntax

2) You are using raw regex pattern without escaping the backslash which will itself be used for escaping the characters following it. So when used in conjuction with regular expressions in python, this will not be valid as you want to. You can either properly format the string by using double backslashes instead of single or use r prefix.

3) u prefix is for unicode. Unless your regex pattern have special unicode characters, this is also not needed. See more about that here: Python regex - r prefix

So finally your correct token_pattern should be:

token_pattern=r'(?u)\b[A-Za-z]+\b'
Vivek Kumar
  • 35,217
  • 8
  • 109
  • 132