0

I have dataset like :

   profile     category  target
0        1      [5, 10]       1
1        2          [1]       0
2        3   [23, 5000]       1
3        4  [700, 4500]       0

How to handle category feature, this table may have others additional features too. One hot encoding lead to consume too much space.because number of rows is around 10 million. Any suggestion would be helpful.

GIRISH kuniyal
  • 740
  • 1
  • 5
  • 14

2 Answers2

1

My idea would be to split this array into new columns:

this would lead to the following dataframe:

   profile     0    1  target
0        1     5    10       1
1        2     1             0
2        3     23   5000       1
3        4     700  4500       0

In the next step you can adjust it, that the categories getting to features (filled by 1 if the profile has this category), based on this, this will lead to the following dataframe:

   profile     1  ...  5  ... 10 ... 23 target
0        1     0       1       0      0      1
1        2     1       0       0      0      0
2        3     0       0       0      1      1
3        4     0       0       0      0      0

You will have every category as a feature, which can help you (it is similar to text classification problems then). Then you can use some techniques for dimension reduction like pca.

With this approach you are respecting the category behavor and could reduce your dimension later on with some maths techniques.

PV8
  • 5,799
  • 7
  • 43
  • 87
  • you are doing kind of "one hot encoding" where lot of memory required. – GIRISH kuniyal Nov 20 '19 at 08:54
  • 1
    I am doing, but the memory will be saved if you do pca, and you could remove all the categories where you only have 0...or always 1 – PV8 Nov 20 '19 at 08:56
  • but what if RAM is not enough for holding those one hot encoded features. because no of rows is 10 million and 5000 feature is added for each row . – GIRISH kuniyal Nov 20 '19 at 08:58
  • the first two steps you could easily split by the number of rows, you could run this in batches, the pca you could also fit for only some rows, and transform it then for the other rows (you will loose some accuracy). With this approach you can split it into batches – PV8 Nov 20 '19 at 09:02
  • Interpretability is heavily impacted and accuracy too. – GIRISH kuniyal Nov 20 '19 at 10:44
0

MultiLabelBinarizer is solution for this kind of problem which gave sparse output low in memory you can convert other feature to sparse matrix than combine all features to feed into Machine learning model.

source

GIRISH kuniyal
  • 740
  • 1
  • 5
  • 14