2

Converting pandas data frame with mixed column types -- numerical, ordinal as well as categorical -- to Scipy sparse arrays is a central problem in machine learning.

Now, if my pandas' data frame consists of only numerical data, then I can simply do the following to convert the data frame to sparse csr matrix:

scipy.sparse.csr_matrix(df.values)

and if my data frame consists of ordinal data types, I can handle them using LabelEncoder

from collections import defaultdict
d = defaultdict(LabelEncoder)     
fit = df.apply(lambda x: d[x.name].fit_transform(x))

Then, I can again use the following and the problem is solved:

scipy.sparse.csr_matrix(df.values)

Categorical variables with a low number of values is also not a concern. They can easily be handled using pd.get_dummies (Pandas or Scikit-Learn versions).

My main concern is for categorical variables with a large number of values.

The main problem: How to handle categorical variables with a large number of values?

pd.get_dummies(train_set, columns=[categorical_columns_with_large_number_of_values], sparse=True)

takes a lot of time.

This question seems to be giving interesting directions, but, it is not clear whether it handles all the data types efficiently.

Let me know if you know the efficient way. Thanks.

learner
  • 857
  • 1
  • 14
  • 28

1 Answers1

1

You can convert any single column to a sparse COO array very easily with factorize. This will be MUCH faster than building a giant dense dataframe.

import pandas as pd
import scipy.sparse

data = pd.DataFrame({"A": ["1", "2", "A", "C", "A"]})

c, u = pd.factorize(data['A'])
n, m = data.shape[0], u.shape[0]

one_hot = scipy.sparse.coo_matrix((np.ones(n, dtype=np.int16), (np.arange(n), c)), shape=(n,m))

You'll get something that looks like this:

>>> one_hot.A
array([[1, 0, 0, 0],
       [0, 1, 0, 0],
       [0, 0, 1, 0],
       [0, 0, 0, 1],
       [0, 0, 1, 0]], dtype=int16)

>>> u
Index(['1', '2', 'A', 'C'], dtype='object')

Where rows are your dataframe rows and columns are the factors of your column (u will have labels for those columns in order)

CJR
  • 3,916
  • 2
  • 10
  • 23
  • How to combine multiple columns in one sparse matrix, then? ML models take in one sparse matrix. – learner May 19 '20 at 23:06
  • This solution does not work. For my single column with `14350959` unique values and `133267714` rows, it says: `MemoryError: Unable to allocate 3.40 PiB for an array with shape (133267714, 14350959) and data type int16`. Handling multiple columns is another problem. – learner May 19 '20 at 23:41
  • 1
    You could `vstack` your encoded arrays but you should very carefully consider what the encoding means and how it works with your ML method. You do not have the memory to make a dense representation of this array and it will fail if you call `array.A`. – CJR May 20 '20 at 15:40
  • 1
    yes, I see your point. But, I think you meant `hstack` (`scipy.sparse.hstack`), no? – learner May 21 '20 at 02:32