2

For example, I have dataframe as per below.

>>> df
  CATX CATY CATZ
0    A    G  AAA
1    B    H  BBB
2    C    I  AAA
3    B    J  CCC
4    A    G  BBB
5    B    H  DDD
6    D    K  EEE
7    E    L  FFF

I want to add new columns on the df based on a values provided in a list. For example, for CATZ, i have a list ['AAA', 'BBB'] that I need to consider to indicate that the obervation is 1 or 0 e.g.

>>> df
  CATX CATY CATZ  AAA  BBB
0    A    G  AAA    1    0
1    B    H  BBB    0    1
2    A    I  AAA    1    0
3    B    J  CCC    0    0
4    A    H  BBB    0    1
5    B    H  DDD    0    0
6    D    K  EEE    0    0
7    E    L  FFF    0    0

This is a bit different from pd.get_dummies as get_dummies considers all the possible values (or k-1 values) on your whole dataframe/column. Currently, what I am doing is to loop through the list and execute apply for every row.

for catz_item in catz_list:
    df[catz_item] = df.apply(lambda x: 1 if x.CATZ == catz_item else 0, axis=1)

Is there any other way to do this aside from iterating through the list (as this loop is a bit slow). To make it more complicated, I am also doing this using combination of CATX and CATY based on a certain list as well, for example [['A', 'G'], ['A', 'H'], ['B', 'H']].

--- edit ---

output with combination of CATX / CATY

>>> df
  CATX CATY CATZ  AAA  BBB  AG  AH  BH
0    A    G  AAA    1    0   1   0   0
1    B    H  BBB    0    1   0   0   1
2    C    I  AAA    1    0   0   0   0
3    B    J  CCC    0    0   0   0   0
4    A    G  BBB    0    1   1   0   0
5    B    H  DDD    0    0   0   0   1
6    D    K  EEE    0    0   0   0   0
7    E    L  FFF    0    0   0   0   0

Code that I am using right now is as per below

catxy_list = [['A', 'G'], ['A', 'H'], ['B', 'H']]   
for catxy_item in catxy_list:
    df[catxy_item[0] + catxy_item[1]] = df.apply(lambda x: 1 if x.CATX == catxy_item[0] and x.CATY == catxy_item[1] else 0, axis=1)
oim
  • 1,141
  • 10
  • 14
  • Can you elaborate on the "to make it more complicated" part? What is your expected output there? – cs95 Jan 09 '18 at 16:28
  • @cᴏʟᴅsᴘᴇᴇᴅ, see my latest edit for the "complicated" part. Thanks. – oim Jan 09 '18 at 16:35
  • Thanks, that was helpful. I've added that to my answer if you want it. – cs95 Jan 09 '18 at 16:38
  • Thanks for the answers, I will do some test and accept the answer that suits my needs as I need to do this for ~10M observations. – oim Jan 09 '18 at 16:45
  • Hmm... you might suffer from a memory blowout when running my code for the complex query... thanks for that. I'll try to see if I can come up with a slower, but more memory efficient technique. – cs95 Jan 09 '18 at 16:47

4 Answers4

3

For the first bit involving CATZ, you can use where/mask + get_dummies -

v = df.CATZ.where(df.CATZ.isin(['AAA', 'BBB'])).str.get_dummies()

Or,

v = df.CATZ.mask(~df.CATZ.isin(['AAA', 'BBB'])).str.get_dummies()

Next, for your "complicated query", make use of numpy broadcasting with an astype conversion at the end -

# initial values to compare
i = np.array([['A', 'G'], ['A', 'H'], ['B', 'H']])
# perform broadcasted comparison with `i` and convert the result to OHEs
j = (df.iloc[:, :-1].values[:, None] == i).all(2).astype(int)

# load the result into a dataframe  with the appropriate column names 
j = pd.DataFrame(j, columns=list(map(''.join, i)))

Another slower, but more memory efficient option for your "complicated query" would be looping over each category and finding the OHE:

ohe = []
for x, y in [['A', 'G'], ['A', 'H'], ['B', 'H']]:
    # generate OHEs for each pair of elements per category
    s = df.CATX.eq(x) & df.CATY.eq(y)  # s = df[['CATX', 'CATY']].isin([x, y]).all(1)
    s.name = ''.join([x, y])
    ohe.append(s)

# concatenate the intermediate results
j = pd.concat(ohe, 1).astype(int)

Finally, you can load df, v, and j into a new dataframe using concat.

pd.concat([df, v, j], 1)

  CATX CATY CATZ  AAA  BBB  AG  AH  BH
0    A    G  AAA    1    0   1   0   0
1    B    H  BBB    0    1   0   0   1
2    C    I  AAA    1    0   0   0   0
3    B    J  CCC    0    0   0   0   0
4    A    G  BBB    0    1   1   0   0
5    B    H  DDD    0    0   0   0   1
6    D    K  EEE    0    0   0   0   0
7    E    L  FFF    0    0   0   0   0
cs95
  • 379,657
  • 97
  • 704
  • 746
2
In [403]: df.join(df.CATZ.str.get_dummies())
Out[403]:
  CATX CATY CATZ  AAA  BBB  CCC  DDD  EEE  FFF
0    A    G  AAA    1    0    0    0    0    0
1    B    H  BBB    0    1    0    0    0    0
2    C    I  AAA    1    0    0    0    0    0
3    B    J  CCC    0    0    1    0    0    0
4    A    G  BBB    0    1    0    0    0    0
5    B    H  DDD    0    0    0    1    0    0
6    D    K  EEE    0    0    0    0    1    0
7    E    L  FFF    0    0    0    0    0    1

or:

In [410]: df.join(df.CATZ[df.CATZ.isin(['AAA','BBB'])].str.get_dummies(), how='left').fillna(0)
Out[410]:
  CATX CATY CATZ  AAA  BBB
0    A    G  AAA  1.0  0.0
1    B    H  BBB  0.0  1.0
2    C    I  AAA  1.0  0.0
3    B    J  CCC  0.0  0.0
4    A    G  BBB  0.0  1.0
5    B    H  DDD  0.0  0.0
6    D    K  EEE  0.0  0.0
7    E    L  FFF  0.0  0.0
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
  • I can't see how the first part of the answer is any helpful. – deadcode Jan 09 '18 at 16:34
  • @deadcode It doesn't hurt to keep it there for those having some similar sort of issue and needing that solution instead. – cs95 Jan 09 '18 at 16:35
  • Agreed. I just wanted to know if it's doing anything other than just get_dummies. It's a one step way to concat I guess. – deadcode Jan 09 '18 at 16:38
2
pd.crosstab(df.index,df.CATZ)[['AAA','BBB']]
Out[66]: 
CATZ   AAA  BBB
row_0          
0        1    0
1        0    1
2        1    0
3        0    0
4        0    1
5        0    0
6        0    0
7        0    0

pd.concat([df,pd.crosstab(df.index,df.CATZ)[['AAA','BBB']]],1)
Out[68]: 
      CATX CATY CATZ  AAA  BBB
row_0                         
0        A    G  AAA    1    0
1        B    H  BBB    0    1
2        C    I  AAA    1    0
3        B    J  CCC    0    0
4        A    G  BBB    0    1
5        B    H  DDD    0    0
6        D    K  EEE    0    0
7        E    L  FFF    0    0
BENY
  • 317,841
  • 20
  • 164
  • 234
1

Think you need two np.where.

df["AAA"] = np.where(df["CATZ"] == "AAA", 1, 0)
df["BBB"] = np.where(df["CATZ"] == "BBB", 1, 0)

This can be slow when labels are a lot!

Tai
  • 7,684
  • 3
  • 29
  • 49