classification algorithm that runs without categorical transformation through encoding

Question

I have a set of 48 feature columns and one binary classification target. While working with classification problem, i am able to load all the algorithms such as Linear,logistic,knn, random forest and boosting classifiers after having categorical to numerical transformation using one-hot encoding or similar. But, without any transformation from categorical to numerical while running algorithms like Random forest and Decision tree i am facing error as " ValueError: could not convert string to float ... "

I am trying for a base model without any changes, please guide.

print(type(X)) ---> <class 'pandas.core.frame.DataFrame'>
print(type(y)) ---- > <class 'pandas.core.series.Series'>



from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics

X_train_rf, X_test_rf, y_train_rf, y_test_rf = train_test_split(X,y,random_state=0)

randomforest = RandomForestClassifier()
randomforest.fit(X_train_rf, y_train_rf)
y_train_pred_rf=randomforest.predict(X_train_rf)
y_pred_rf= randomforest.predict(X_test_rf)

print('training accuracy',accuracy_score(y_train_rf,y_train_pred_rf))

print('test accuracy',accuracy_score(y_test_rf,y_pred_rf))

# The o/p obtained is :


 ValueError: could not convert string to float: 'Delhi' (# Delhi- the element in an feature column )

Possible duplicate of [Passing categorical data to Sklearn Decision Tree](https://stackoverflow.com/questions/38108832/passing-categorical-data-to-sklearn-decision-tree) — PV8, Aug 19 '19 at 10:40
check this out: https://stackoverflow.com/questions/38108832/passing-categorical-data-to-sklearn-decision-tree — PV8, Aug 19 '19 at 10:41
yupe, by reading the above is there any way except one hot encoding to parse categorical variables into Random forest or decision tree algorithms ? So, One hot encoding is must ?? — Ayyasamy, Aug 19 '19 at 10:46
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html — PV8, Aug 19 '19 at 10:49

score 0 · Accepted Answer · answered Aug 19 '19 at 14:04

0

You can use the python-weka wrapper, and then you will not need one-hot encoding. Example:

import weka.core.jvm as jvm
from weka.core.converters import Loader
from weka.classifiers import Classifier

def get_weka_prob(inst):
    dist = c.distribution_for_instance(inst)
    p = dist[next((i for i, x in enumerate(inst.class_attribute.values) if x == 'DONE'), -1)]
    return p

jvm.start()

loader = Loader(classname="weka.core.converters.CSVLoader")
data = loader.load_file(r'.\recs_csv\df.csv')
data.class_is_last()

datatst = loader.load_file(r'.\recs_csv\dftst.csv')
datatst.class_is_last()

c = Classifier("weka.classifiers.trees.J48", options=["-C", "0.1"])

c.build_classifier(data)
print(c)
probstst = [get_weka_prob(inst) for inst in datatst]

jvm.stop()

answered Aug 19 '19 at 14:04

Roee Anuar

3,071
1
19
33

Thanks for the response, will check with that,can we have this applied for all columns in data frame and then apply for sklearn algorithms any – Ayyasamy Aug 20 '19 at 03:48
Weka models are different models that use a java bridge to python - the methods are java methods that can be called using this bridge. To use the dataframe in sklearn - you would have to manipulate it with one-hot encoding. Note that the nominal attributes in weka cannot have any special character in them. so use df = df.replace([',', '"', "'", "%", ";"], '', regex=True) in any nominal attribute before saving it to csv. – Roee Anuar Aug 20 '19 at 05:08

classification algorithm that runs without categorical transformation through encoding

1 Answers1