I have some data, around 400 million rows, some features are categorical. I apply pandas.get_dummies
to do one-hot encoding, and I have to use sparse=True
option because the data is a little big(otherwise exceptions/errors are raised).
result = result.drop(["time", "Ds"], 1)
result_encoded = pd.get_dummies(result, columns=["id1", "id2", "id3", "id4"], sparse=True)
Then, I get a sparse dataframe(result_encoded) with 9000 features. After that, I want to run a ridge regression on the data. At first, I tried to feed dataframe.value
into sklearn,
train_data = result_encoded.drop(['count'].values)
but raised the error: "array is too big". Then, I just fed sparse dataframe to sklearn, similar error message showed again.
train_data = result_encoded.drop(['count'])
Do I need to consider a different method or preparation of the data so it can be used by sklearn directly?