NotFittedError: TfidfVectorizer - Vocabulary wasn't fitted python

Question

Goal: Predict labels on my original data

Background: I constructed an SVM classifier

I am using the following code:

0) Import modules

    import numpy as np
    from sklearn import cross_validation
    from sklearn import datasets
    from sklearn import svm
    from sklearn.model_selection import StratifiedKFold
    from sklearn.metrics import precision_score, recall_score,accuracy_score
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.metrics import precision_recall_fscore_support

1) X_list and y

type(X_list) #list, strings
len(X_list)  #2163
type(y) #numpy.ndarray
len(y)  #2163

2) convert X_list from string to float, use tfidf

tfidf = TfidfVectorizer()
X_vec = tfidf.fit_transform(X_list) 
X = X_vec.toarray()

3) X shape

X.shape  (2163, 8753)

4) 10 fold validation and SVM

skf = StratifiedKFold(n_splits=10) 
clf = svm.SVC(kernel='linear', C=1)

5) loop through 10 folds

precision_scores = []
recall_scores = []
f_scores = [] 

for train_index, test_index in skf.split(X, y): 
    X_train = X[train_index]
    X_test =  X[test_index]
    y_train = y[train_index]
    y_test =  y[test_index]

    clf.fit(X_train, y_train) 
    y_pred = clf.predict(X_test)

    precision_scores.append(scores[0])
    recall_scores.append(scores[1])
    f_scores.append(scores[2])

6) Predict on original dataset X_original

type(X_original) #list, strings
len(X_original)  #2163

7) Convert X_original from string to float

tfidf = TfidfVectorizer()
X_original_transform = tfidf.transform(X_original)

But when I do so I get the following Error

`NotFittedError: TfidfVectorizer - Vocabulary wasn't fitted.`

SO has a similar question but it seems different from my issue NotFittedError: TfidfVectorizer - Vocabulary wasn't fitted

8) How do I fix this error?

Can you post the complete code? Not just snippets. Are you initializing the `tfidf` again anywhere? — Vivek Kumar, Mar 01 '18 at 06:23
I have updated my code to show the almost full version (Except for the actual `X` or `y`) — , Mar 01 '18 at 22:10

score 1 · Answer 1 · answered Mar 02 '18 at 06:35

In the point (7) above, you can see that you are initializing the tfidf again which generates a new instance of TfidfVectorizer which does not have any data or information. Then you are not fitting it. Hence the error. You need to call fit() on it same way as you did in point (2).

Change point (7) to:

tfidf = TfidfVectorizer()
# fit_transform should be used here.
X_original_transform = tfidf.fit_transform(X_original)

Also in point (2), you are first fitting the TfidfVectorizer on whole of the dataset and then splitting it into train and test. This is not recommended as it leaks the information about the data to the model when training. Consider how this works in real world situation. Do you have all the information about the data that you want to predict in advance? No. You train the model on available data and use it on unseen data. Your current code in point (2) breaks this.

Always first split into train and test and then train (fit()) only on training data and use that information to apply (transform()) on testing data.

Change it like this:

1) First remove the code in point (2). We will be doing it inside the folds iteration.

2) Change point (5) like:

for train_index, test_index in skf.split(X_list, y): 
    X_train = X_list[train_index]
    X_test =  X_list[test_index]
    y_train = y[train_index]
    y_test =  y[test_index]

    tfidf = TfidfVectorizer()

    # This is what I'm talking about
    X_train = tfidf.fit_transform(X_train) 
    clf.fit(X_train, y_train) 

    # Only call transform() here
    X_test = tfidf.transform(X_test) 
    y_pred = clf.predict(X_test)

    precision_scores.append(scores[0])
    recall_scores.append(scores[1])
    f_scores.append(scores[2])

Thank you for the detailed pointers and code. However, I get the following error when I try what you have listed above `TypeError Traceback (most recent call last) in () 1 #loop through 10 folds 2 for train_index, test_index in skf.split(X_list, y): ----> 3 X_train = X_list[train_index] 4 X_test = X_list[test_index] 5 y_train = y[train_index]` **TypeError: only integer scalar arrays can be converted to a scalar index** — , Mar 02 '18 at 14:47

NotFittedError: TfidfVectorizer - Vocabulary wasn't fitted python

1 Answers1