Dummify variables based on list of categories

Question

For example, I have dataframe as per below.

>>> df
  CATX CATY CATZ
0    A    G  AAA
1    B    H  BBB
2    C    I  AAA
3    B    J  CCC
4    A    G  BBB
5    B    H  DDD
6    D    K  EEE
7    E    L  FFF

I want to add new columns on the df based on a values provided in a list. For example, for CATZ, i have a list ['AAA', 'BBB'] that I need to consider to indicate that the obervation is 1 or 0 e.g.

>>> df
  CATX CATY CATZ  AAA  BBB
0    A    G  AAA    1    0
1    B    H  BBB    0    1
2    A    I  AAA    1    0
3    B    J  CCC    0    0
4    A    H  BBB    0    1
5    B    H  DDD    0    0
6    D    K  EEE    0    0
7    E    L  FFF    0    0

This is a bit different from pd.get_dummies as get_dummies considers all the possible values (or k-1 values) on your whole dataframe/column. Currently, what I am doing is to loop through the list and execute apply for every row.

for catz_item in catz_list:
    df[catz_item] = df.apply(lambda x: 1 if x.CATZ == catz_item else 0, axis=1)

Is there any other way to do this aside from iterating through the list (as this loop is a bit slow). To make it more complicated, I am also doing this using combination of CATX and CATY based on a certain list as well, for example [['A', 'G'], ['A', 'H'], ['B', 'H']].

--- edit ---

output with combination of CATX / CATY

>>> df
  CATX CATY CATZ  AAA  BBB  AG  AH  BH
0    A    G  AAA    1    0   1   0   0
1    B    H  BBB    0    1   0   0   1
2    C    I  AAA    1    0   0   0   0
3    B    J  CCC    0    0   0   0   0
4    A    G  BBB    0    1   1   0   0
5    B    H  DDD    0    0   0   0   1
6    D    K  EEE    0    0   0   0   0
7    E    L  FFF    0    0   0   0   0

Code that I am using right now is as per below

catxy_list = [['A', 'G'], ['A', 'H'], ['B', 'H']]   
for catxy_item in catxy_list:
    df[catxy_item[0] + catxy_item[1]] = df.apply(lambda x: 1 if x.CATX == catxy_item[0] and x.CATY == catxy_item[1] else 0, axis=1)

Can you elaborate on the "to make it more complicated" part? What is your expected output there? — cs95, Jan 09 '18 at 16:28
@cᴏʟᴅsᴘᴇᴇᴅ, see my latest edit for the "complicated" part. Thanks. — oim, Jan 09 '18 at 16:35
Thanks, that was helpful. I've added that to my answer if you want it. — cs95, Jan 09 '18 at 16:38
Thanks for the answers, I will do some test and accept the answer that suits my needs as I need to do this for ~10M observations. — oim, Jan 09 '18 at 16:45
Hmm... you might suffer from a memory blowout when running my code for the complex query... thanks for that. I'll try to see if I can come up with a slower, but more memory efficient technique. — cs95, Jan 09 '18 at 16:47

cs95 · Accepted Answer · 2018-01-09T17:01:30.650

For the first bit involving CATZ, you can use where/mask + get_dummies -

v = df.CATZ.where(df.CATZ.isin(['AAA', 'BBB'])).str.get_dummies()

Or,

v = df.CATZ.mask(~df.CATZ.isin(['AAA', 'BBB'])).str.get_dummies()

Next, for your "complicated query", make use of numpy broadcasting with an astype conversion at the end -

# initial values to compare
i = np.array([['A', 'G'], ['A', 'H'], ['B', 'H']])
# perform broadcasted comparison with `i` and convert the result to OHEs
j = (df.iloc[:, :-1].values[:, None] == i).all(2).astype(int)

# load the result into a dataframe  with the appropriate column names 
j = pd.DataFrame(j, columns=list(map(''.join, i)))

Another slower, but more memory efficient option for your "complicated query" would be looping over each category and finding the OHE:

ohe = []
for x, y in [['A', 'G'], ['A', 'H'], ['B', 'H']]:
    # generate OHEs for each pair of elements per category
    s = df.CATX.eq(x) & df.CATY.eq(y)  # s = df[['CATX', 'CATY']].isin([x, y]).all(1)
    s.name = ''.join([x, y])
    ohe.append(s)

# concatenate the intermediate results
j = pd.concat(ohe, 1).astype(int)

Finally, you can load df, v, and j into a new dataframe using concat.

pd.concat([df, v, j], 1)

  CATX CATY CATZ  AAA  BBB  AG  AH  BH
0    A    G  AAA    1    0   1   0   0
1    B    H  BBB    0    1   0   0   1
2    C    I  AAA    1    0   0   0   0
3    B    J  CCC    0    0   0   0   0
4    A    G  BBB    0    1   1   0   0
5    B    H  DDD    0    0   0   0   1
6    D    K  EEE    0    0   0   0   0
7    E    L  FFF    0    0   0   0   0

score 2 · Answer 2 · answered Jan 09 '18 at 16:25

In [403]: df.join(df.CATZ.str.get_dummies())
Out[403]:
  CATX CATY CATZ  AAA  BBB  CCC  DDD  EEE  FFF
0    A    G  AAA    1    0    0    0    0    0
1    B    H  BBB    0    1    0    0    0    0
2    C    I  AAA    1    0    0    0    0    0
3    B    J  CCC    0    0    1    0    0    0
4    A    G  BBB    0    1    0    0    0    0
5    B    H  DDD    0    0    0    1    0    0
6    D    K  EEE    0    0    0    0    1    0
7    E    L  FFF    0    0    0    0    0    1

or:

In [410]: df.join(df.CATZ[df.CATZ.isin(['AAA','BBB'])].str.get_dummies(), how='left').fillna(0)
Out[410]:
  CATX CATY CATZ  AAA  BBB
0    A    G  AAA  1.0  0.0
1    B    H  BBB  0.0  1.0
2    C    I  AAA  1.0  0.0
3    B    J  CCC  0.0  0.0
4    A    G  BBB  0.0  1.0
5    B    H  DDD  0.0  0.0
6    D    K  EEE  0.0  0.0
7    E    L  FFF  0.0  0.0

I can't see how the first part of the answer is any helpful. — deadcode, Jan 09 '18 at 16:34
@deadcode It doesn't hurt to keep it there for those having some similar sort of issue and needing that solution instead. — cs95, Jan 09 '18 at 16:35
Agreed. I just wanted to know if it's doing anything other than just get_dummies. It's a one step way to concat I guess. — deadcode, Jan 09 '18 at 16:38

score 2 · Answer 3 · answered Jan 09 '18 at 16:32

2

pd.crosstab(df.index,df.CATZ)[['AAA','BBB']]
Out[66]: 
CATZ   AAA  BBB
row_0          
0        1    0
1        0    1
2        1    0
3        0    0
4        0    1
5        0    0
6        0    0
7        0    0

pd.concat([df,pd.crosstab(df.index,df.CATZ)[['AAA','BBB']]],1)
Out[68]: 
      CATX CATY CATZ  AAA  BBB
row_0                         
0        A    G  AAA    1    0
1        B    H  BBB    0    1
2        C    I  AAA    1    0
3        B    J  CCC    0    0
4        A    G  BBB    0    1
5        B    H  DDD    0    0
6        D    K  EEE    0    0
7        E    L  FFF    0    0

answered Jan 09 '18 at 16:32

BENY

317,841
20
164
234

Learn a new method! – Tai Jan 09 '18 at 16:38
@Tai good material for you, https://stackoverflow.com/questions/47152691/how-to-pivot-a-dataframe/47152692#47152692 – BENY Jan 09 '18 at 16:39

Tai · Answer 4 · 2018-01-09T16:43:14.567

1

Think you need two np.where.

df["AAA"] = np.where(df["CATZ"] == "AAA", 1, 0)
df["BBB"] = np.where(df["CATZ"] == "BBB", 1, 0)

This can be slow when labels are a lot!

edited Jan 09 '18 at 16:43

answered Jan 09 '18 at 16:26

Tai

7,684
3
29
49

And for a 100 list items, he will need 100 np.where ;-) – cs95 Jan 09 '18 at 16:34
@Wen Thanks, Wen :P – Tai Jan 09 '18 at 16:37
@cᴏʟᴅsᴘᴇᴇᴅ He can use a for loop :P when there is 100 items :P – Tai Jan 09 '18 at 16:38
Aww, loop? :-( That's totally sidestepping and shoehorning the vectorised capabilities of np.where... might as well use a pure python function :D – cs95 Jan 09 '18 at 16:39
for loop which i have been doing now and i find it slow. :) – oim Jan 09 '18 at 16:40
@cᴏʟᴅsᴘᴇᴇᴅ Thanks for the comments!! Editted. – Tai Jan 09 '18 at 16:47
@user8505495 Think you might adopt others' great answer :P – Tai Jan 09 '18 at 16:48

Dummify variables based on list of categories

4 Answers4