How to handle multi-label categorical feature for binary classification problem?

Question

I have dataset like :

   profile     category  target
0        1      [5, 10]       1
1        2          [1]       0
2        3   [23, 5000]       1
3        4  [700, 4500]       0

How to handle category feature, this table may have others additional features too. One hot encoding lead to consume too much space.because number of rows is around 10 million. Any suggestion would be helpful.

the question is, what stands category for? does this means profile 1 has category 5 and 10 in it? — PV8, Nov 20 '19 at 07:04
its an array which contain information of categories this profile belong to. like profile 1 belong to both category 5 and 10. — GIRISH kuniyal, Nov 20 '19 at 07:45

score 1 · Answer 1 · answered Nov 20 '19 at 08:03

1

My idea would be to split this array into new columns:

this would lead to the following dataframe:

   profile     0    1  target
0        1     5    10       1
1        2     1             0
2        3     23   5000       1
3        4     700  4500       0

In the next step you can adjust it, that the categories getting to features (filled by 1 if the profile has this category), based on this, this will lead to the following dataframe:

   profile     1  ...  5  ... 10 ... 23 target
0        1     0       1       0      0      1
1        2     1       0       0      0      0
2        3     0       0       0      1      1
3        4     0       0       0      0      0

You will have every category as a feature, which can help you (it is similar to text classification problems then). Then you can use some techniques for dimension reduction like pca.

With this approach you are respecting the category behavor and could reduce your dimension later on with some maths techniques.

answered Nov 20 '19 at 08:03

PV8

5,799
7
43
87

you are doing kind of "one hot encoding" where lot of memory required. – GIRISH kuniyal Nov 20 '19 at 08:54
1

I am doing, but the memory will be saved if you do pca, and you could remove all the categories where you only have 0...or always 1 – PV8 Nov 20 '19 at 08:56
but what if RAM is not enough for holding those one hot encoded features. because no of rows is 10 million and 5000 feature is added for each row . – GIRISH kuniyal Nov 20 '19 at 08:58
the first two steps you could easily split by the number of rows, you could run this in batches, the pca you could also fit for only some rows, and transform it then for the other rows (you will loose some accuracy). With this approach you can split it into batches – PV8 Nov 20 '19 at 09:02
Interpretability is heavily impacted and accuracy too. – GIRISH kuniyal Nov 20 '19 at 10:44

score 0 · Accepted Answer · answered Nov 26 '19 at 04:44

0

MultiLabelBinarizer is solution for this kind of problem which gave sparse output low in memory you can convert other feature to sparse matrix than combine all features to feed into Machine learning model.

source

answered Nov 26 '19 at 04:44

GIRISH kuniyal

740
1
5
14

How to handle multi-label categorical feature for binary classification problem?

2 Answers2