0

I have a set of 48 feature columns and one binary classification target. While working with classification problem, i am able to load all the algorithms such as Linear,logistic,knn, random forest and boosting classifiers after having categorical to numerical transformation using one-hot encoding or similar. But, without any transformation from categorical to numerical while running algorithms like Random forest and Decision tree i am facing error as " ValueError: could not convert string to float ... "

I am trying for a base model without any changes, please guide.

print(type(X)) ---> <class 'pandas.core.frame.DataFrame'>
print(type(y)) ---- > <class 'pandas.core.series.Series'>



from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics

X_train_rf, X_test_rf, y_train_rf, y_test_rf = train_test_split(X,y,random_state=0)

randomforest = RandomForestClassifier()
randomforest.fit(X_train_rf, y_train_rf)
y_train_pred_rf=randomforest.predict(X_train_rf)
y_pred_rf= randomforest.predict(X_test_rf)

print('training accuracy',accuracy_score(y_train_rf,y_train_pred_rf))

print('test accuracy',accuracy_score(y_test_rf,y_pred_rf))

# The o/p obtained is :


 ValueError: could not convert string to float: 'Delhi' (# Delhi- the element in an feature column )
desertnaut
  • 57,590
  • 26
  • 140
  • 166
Ayyasamy
  • 149
  • 1
  • 13
  • 1
    Possible duplicate of [Passing categorical data to Sklearn Decision Tree](https://stackoverflow.com/questions/38108832/passing-categorical-data-to-sklearn-decision-tree) – PV8 Aug 19 '19 at 10:40
  • check this out: https://stackoverflow.com/questions/38108832/passing-categorical-data-to-sklearn-decision-tree – PV8 Aug 19 '19 at 10:41
  • yupe, by reading the above is there any way except one hot encoding to parse categorical variables into Random forest or decision tree algorithms ? So, One hot encoding is must ?? – Ayyasamy Aug 19 '19 at 10:46
  • https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html – PV8 Aug 19 '19 at 10:49

1 Answers1

0

You can use the python-weka wrapper, and then you will not need one-hot encoding. Example:

import weka.core.jvm as jvm
from weka.core.converters import Loader
from weka.classifiers import Classifier

def get_weka_prob(inst):
    dist = c.distribution_for_instance(inst)
    p = dist[next((i for i, x in enumerate(inst.class_attribute.values) if x == 'DONE'), -1)]
    return p

jvm.start()

loader = Loader(classname="weka.core.converters.CSVLoader")
data = loader.load_file(r'.\recs_csv\df.csv')
data.class_is_last()

datatst = loader.load_file(r'.\recs_csv\dftst.csv')
datatst.class_is_last()

c = Classifier("weka.classifiers.trees.J48", options=["-C", "0.1"])

c.build_classifier(data)
print(c)
probstst = [get_weka_prob(inst) for inst in datatst]

jvm.stop()
Roee Anuar
  • 3,071
  • 1
  • 19
  • 33
  • Thanks for the response, will check with that,can we have this applied for all columns in data frame and then apply for sklearn algorithms any – Ayyasamy Aug 20 '19 at 03:48
  • Weka models are different models that use a java bridge to python - the methods are java methods that can be called using this bridge. To use the dataframe in sklearn - you would have to manipulate it with one-hot encoding. Note that the nominal attributes in weka cannot have any special character in them. so use df = df.replace([',', '"', "'", "%", ";"], '', regex=True) in any nominal attribute before saving it to csv. – Roee Anuar Aug 20 '19 at 05:08