How to predict a new document's category in Scikit Learn

Question

I am trying to make a document classification software that can classify a document into categories like Financial, Political, Entertainment, etc.

I am using BBC data set and made a TFIDF vector and used RandomForest Classifier to build a machine learning model. I also saved it into a pickel file

Now I can't figure out how to use the saved pickel file and predict the category of a new document. I have wrote the code to open a new document and do all the pre processing and get the pre processed text. How to use this text to classify it using the saved model ? I can't figure out how to add this document to my existing TFIDF vector.

I have this documents array with text files and here is how i used to train the model.

vectorizer = CountVectorizer(max_features=1000 , min_df=5, max_df=0.8)
X = vectorizer.fit_transform(documents).toarray()

tfidfConverter = TfidfTransformer()
X = tfidfConverter.fit_transform(X).toarray()

X_Train , X_Test , Y_Train , Y_Test = train_test_split(X,Y,test_size=0.3 , random_state=0)

classifier = RandomForestClassifier(n_estimators=1000 , random_state=0)
classifier.fit(X_Train,Y_Train)

Y_Predict = classifier.predict(X_Test)

with open('text_classifier','wb') as pickleFile:
    pickle.dump(classifier,pickleFile)

Its'a bit unclear what you are asking. Are you having trouble with the feature extraction when loading the model ? it might be because you didn't save the `TfIdfTransformer` as well. Maybe using a [pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) would be more appropriate. — Zaccharie Ramzi, May 18 '19 at 11:02
I also noticed that you fitted your transformation with test data which generally not a good practice. — Zaccharie Ramzi, May 18 '19 at 11:03
@ZaccharieRamzi thank you for your response. can u give me a hint on how to save the TfIdfTransformer with the model. I am sorry I am just new to this thing. I did not understand what u secondly said, can u please explain me a little — Nuraj Chaminda, May 18 '19 at 11:46
Have you read the link on pipelines? It will give you some insights on how to save whole models (maybe also see [this question](https://stackoverflow.com/questions/34143829/sklearn-how-to-save-a-model-created-from-a-pipeline-and-gridsearchcv-using-jobli)). In any case try to reformulate your question to make what you are asking clearer. Second point is just the basic "don't fit on your test data". TfIdf transformation is also a part of your model so it shouldn't "see" testing data during training. — Zaccharie Ramzi, May 18 '19 at 14:11

How to predict a new document's category in Scikit Learn

0 Answers0