Random Forest algorithm as an input in Python

Question

I've built, trained, and saved a RF algorithm model using Python having the following features:

Number of deleted files (integer).
Path (string)
Severity (integer)

Taking into account sk-learn doesn't deal with strings, I have converted the data using CountVectorizer. How to take the user input path (string) and convert it to the same format as the saved model in order to make Severity predictions? Note that the predictions using strings print(clf.predict([[5, '/some/path']])) result in error:

ValueError: Iterable over raw text documents expected, string object received.

No, both solutions produce another error "TypeError: float() argument must be a string or a number, not 'CountVectorizer'" — Ray, Dec 17 '21 at 01:10
Then please open a new question with a full [mre], explaining that these solutions do not work (and link also here). — desertnaut, Dec 17 '21 at 09:06

score 0 · Answer 1 · answered Dec 17 '21 at 00:40

If your model takes the transformed path (i.e., converted using CountVectorizer) in the training phase, then you also need to apply the transformation in the inference phase. So, it should be something like this.

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
# recall that you have fitted it before
#vectorizer.fit(X_train)

print(clf.predict([[5, vecorizer.transform('/some/path')]]))

Random Forest algorithm as an input in Python

1 Answers1