I have the following data frame or a text classification problem:
X = df['text'].apply(clean_text)
clean_text
does some of the basic cleaning required for this particular case like removal of special characters, converting numeric values in the text to buckets etc. As a next step - I create a pipeline as shown below:
text_clsf = Pipeline([('tfidf',TfidfVectorizer(use_idf=True,max_df=max_df,min_df=min_df,stop_words=stop_words,ngram_range=(1,2))),
('clsf',LinearSVC(C=C))])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)
text_clsf.fit(X_train,y_train)
This model was tested using test data, and saved using pickle library
pickle.dumps(text_clsf,open(file_path,'wb'))
Upon using this model on real world data, I use pickle
again to load the model, and then for my real world text I apply cleaning using the same clean
method as shown below.
text_value = 'this is my real world sample text to predict'
text_cleaned = clean_text(text_value)
text_clsf = pickle.loads(open(filepath,'rb'))
I have a concern regardin the next step: Can I code, as shown below, directly to do the prediction? I haven't created bigrams
here though the tfidf vectorizer was given ngram_range(1,2)
, text_cleaned
contains just the cleaned text.
text_clsf.predict([text_cleaned])
Or is it that the model text_clsf
by itself takes care of creating bigrams
.