Converting pandas data frame with mixed column types -- numerical, ordinal as well as categorical -- to Scipy sparse arrays is a central problem in machine learning.
Now, if my pandas' data frame consists of only numerical data, then I can simply do the following to convert the data frame to sparse csr matrix:
scipy.sparse.csr_matrix(df.values)
and if my data frame consists of ordinal data types, I can handle them using LabelEncoder
from collections import defaultdict
d = defaultdict(LabelEncoder)
fit = df.apply(lambda x: d[x.name].fit_transform(x))
Then, I can again use the following and the problem is solved:
scipy.sparse.csr_matrix(df.values)
Categorical variables with a low number of values is also not a concern. They can easily be handled using pd.get_dummies (Pandas or Scikit-Learn versions).
My main concern is for categorical variables with a large number of values.
The main problem: How to handle categorical variables with a large number of values?
pd.get_dummies(train_set, columns=[categorical_columns_with_large_number_of_values], sparse=True)
takes a lot of time.
This question seems to be giving interesting directions, but, it is not clear whether it handles all the data types efficiently.
Let me know if you know the efficient way. Thanks.