1

I am trying to perform some speed comparison test Python vs R and struggling with issue - LinearRegression under sklearn with categorical variables.

Code R:

# Start the clock!
ptm <- proc.time()
ptm

test_data = read.csv("clean_hold.out.csv")

# Regression Model
model_liner = lm(test_data$HH_F ~ ., data = test_data)

# Stop the clock
new_ptm <- proc.time() - ptm 

Code Python:

import pandas as pd
import time

from sklearn.linear_model import LinearRegression
from sklearn.feature_extraction import DictVectorizer

start = time.time()

test_data = pd.read_csv("./clean_hold.out.csv")

x_train = [col for col in test_data.columns[1:] if col != 'HH_F']
y_train = ['HH_F']

model_linear = LinearRegression(normalize=False)
model_linear.fit(test_data[x_train], test_data[y_train])

but it's not work for me

return X.astype(np.float32 if X.dtype == np.int32 else np.float64) ValueError: could not convert string to float: Bee True

I was tried another approach

test_data = pd.read_csv("./clean_hold.out.csv").to_dict()
v = DictVectorizer(sparse=False)
X = v.fit_transform(test_data)

However, I catched another error:

File "C:\Anaconda32\lib\site-packages\sklearn\feature_extraction\dict_vectorizer.py", line 258, in transform Xa[i, vocab[f]] = dtype(v) TypeError: float() argument must be a string or a number

I don't understand how Python should resolve this issues ...

Example of data: http://screencast.com/t/hYyyu7nU9hQm

SpanishBoy
  • 2,105
  • 6
  • 28
  • 51
  • 1
    Without your source data, we are going to have a hard time debugging anything. You might also consider whether CrossValidated is a better fit. – TARehman Oct 05 '15 at 14:00
  • Added the very small part of data. Please advise – SpanishBoy Oct 05 '15 at 17:04
  • I think you might benefit from http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example - a screenshot of data is not super-useful in our case. – TARehman Oct 05 '15 at 17:35
  • I am interested whether only R can handle any types of variables or Python/.NET also can process both numeric and categorical variables. Because, I see that Python can't process categorical in `sklearn`. Once I pass only float/int variables everything is OK – SpanishBoy Oct 06 '15 at 07:43

1 Answers1

1

I have to do some encoding before using fit.

There are several classes that can be used :

LabelEncoder : turn your string into incremental value
OneHotEncoder : use One-of-K algorithm to transform your String into integer

I wanted to have a scalable solution but didn't get any answer. I selected OneHotEncoder that binarize all the strings. It is quite effective but if you have a lot different strings the matrix will grow very quickly and memory will be required.

SpanishBoy
  • 2,105
  • 6
  • 28
  • 51