1

I want to use RandomForestClassifier for sentiment classification. The x contains data in string text, so I used LabelEncoder to convert strings. Y contains data in numbers. And my code is this:

import pandas as pd
import numpy as np
from sklearn.model_selection import *
from sklearn.ensemble import *
from sklearn import *
from sklearn.preprocessing.label import LabelEncoder

data = pd.read_csv('data.csv')

x = data['Reviews']
y = data['Ratings']

le = LabelEncoder()
x_encoded = le.fit_transform(x)

x_train, x_test, y_train, y_test = train_test_split(x_encoded,y, test_size = 0.2)

x_train = x_train.reshape(-1,1)
x_test = x_test.reshape(-1,1)

clf = RandomForestClassifier(n_estimators=100)

clf.fit(x_train, y_train)

y_pred = clf.predict(x_test)

Then I printed out the accuracy like below:

print("Accuracy:", metrics.accuracy_score(y_test, y_pred))

And here's the output:

Accuracy: 0.5975

I have read that Random forests has high accuracy, because of the number of decision trees participating in the process. But I think that the accuracy is much lower than it should be. I have looked for some similar questions on Stack Overflow, but I couldn't find a solution for my problem.

Is there any problem in my code using Random Forest library? Or is there any exceptions of cases when using Random forest?

deponovo
  • 1,114
  • 7
  • 23
Hally
  • 81
  • 8
  • 1
    you should focus on preprocessing `reviews` column – Prakash Dahal Jan 12 '22 at 07:10
  • sklearn doc for [`LabelEncoder`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html) mentions: *"This transformer should be used to encode target values, i.e. y, and not the input X"*. You should familiarize yourself with text representation methods if you want to train decent models for text data. Btw if you look at the features after encoding you will understand the problem. – Erwan Jan 12 '22 at 16:50

1 Answers1

3

It is not a problem regarding Random Forests or the library, it is rather a problem how you transform your text input into a feature or feature vector.

What LabelEncoding does is; given some labels like ["a", "b", "c"] it transforms those labels into numeric values between 0 and n-1 with n-being the number of distinct input labels. However, I assume Reviews contain texts and not pure labels so to say. This means, all your reviews (if not 100% identical) are transformed into different labels. Eventually, this leads to your classifier doing random stuff. give that input. This means you need something different to transform your textual input into a numeric input that Random Forests can work on.

As a simple start, you can try something like TfIDF or also some simple count vectorizer. Those are available from sklearn https://scikit-learn.org/stable/modules/feature_extraction.html section 6.2.3. Text feature extraction. There are more sophisticated ways of transforming texts into numeric vectors but that should be a good start for you to understand what has to happen conceptually.

A last important note is that you fit those vectorizers only on the training set and not on the full dataset. Otherwise, you might leak information from training to evaluation/testing. A good way of doing this would be to build a sklearn pipeline that consists of a feature transformation step and the classifier.

Simon Hawe
  • 3,968
  • 6
  • 14
  • For the last note, what do you mean by leak information from training to evaluation? Do you mean if preprocessing on the whole dataset of x is done before train_test_split? – Hally Jan 13 '22 at 08:03
  • Yes exactly. So you would include information in your training set that you don't have in reality. – Simon Hawe Jan 13 '22 at 08:05
  • However, if I fit the count vectorizer only on the training set, how should I test the trained model? I thought I should predict the y label by vectorized testing dataset. Is there need for doing processing of training set and testing set separately? – Hally Jan 13 '22 at 09:18
  • You a fitting the Vectorizer on the training set, you are using it on the test set. It is the same as for the classifier. That is why I meant you should have a look at building a pipeline, which exposes the fit and predict apis, and there it might be more obvious. Look at this example https://stackoverflow.com/questions/33091376/what-is-exactly-sklearn-pipeline-pipeline/33094099#33094099 – Simon Hawe Jan 13 '22 at 09:31
  • Thank you! I have solved this problem by vectorization and tfid transformation. The accuracy is improved as well! – Hally Jan 16 '22 at 07:12