Using a text classifier model vectorized done using Tfidf with unigrams and bigrams

Question

I have the following data frame or a text classification problem:

 X = df['text'].apply(clean_text)

clean_text does some of the basic cleaning required for this particular case like removal of special characters, converting numeric values in the text to buckets etc. As a next step - I create a pipeline as shown below:

 text_clsf = Pipeline([('tfidf',TfidfVectorizer(use_idf=True,max_df=max_df,min_df=min_df,stop_words=stop_words,ngram_range=(1,2))),
                      ('clsf',LinearSVC(C=C))])

 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)
 text_clsf.fit(X_train,y_train)

This model was tested using test data, and saved using pickle library

 pickle.dumps(text_clsf,open(file_path,'wb'))

Upon using this model on real world data, I use pickle again to load the model, and then for my real world text I apply cleaning using the same clean method as shown below.

 text_value = 'this is my real world sample text to predict'
 text_cleaned = clean_text(text_value)
 text_clsf = pickle.loads(open(filepath,'rb'))

I have a concern regardin the next step: Can I code, as shown below, directly to do the prediction? I haven't created bigrams here though the tfidf vectorizer was given ngram_range(1,2), text_cleaned contains just the cleaned text. text_clsf.predict([text_cleaned])

Or is it that the model text_clsf by itself takes care of creating bigrams.

Yes, `text_clsf` is a `Pipeline` object so all steps of the pipeline are applied to your cleaned text string. You don't need to worry about the TF-IDF bigrams part again as that is already taken care of. — meenaparam, Feb 04 '20 at 12:00
@meenaparam Thank you. Is there anyway I can see what fields are being passed on for prediction? just for verification purpose — Jithin P James, Feb 04 '20 at 12:06
You can try using the `named_steps` of your pipeline. See here for an example: https://stackoverflow.com/a/28837740/5269252 — meenaparam, Feb 04 '20 at 12:13

Using a text classifier model vectorized done using Tfidf with unigrams and bigrams

0 Answers0