5

I have a dataframe with binary values post performing get_dummies using pandas

df= 
Values  A1  A2  B1  B2  B3  B4  C1  C2  C3
10      1   0   1   0   0   0   1   0   0
12      0   1   0   0   1   0   0   1   0
3       0   1   0   1   0   0   0   0   1
5       1   0   0   0   0   1   1   0   0

I want a new column which has combination of all columns which has 1's in it

Expected output:

Values  A1  A2  B1  B2  B3  B4  C1  C2  C3  Combination
10      1   0   1   0   0   0   1   0   0   A1~~B1~~C1
12      0   1   0   0   1   0   0   1   0   A2~~B3~~C2
3       0   1   0   1   0   0   0   0   1   A2~~B2~~C3
5       1   0   0   0   0   1   1   0   0   A1~~B4~~C3

Actual matrix can be 25000+ rows*1000+columns

There is a similar solution found in R but i need it in Python bcoz all other dependencies are in python and R is new to me.

Extract column names with value 1 in binary matrix

Codes in R below, & need similar one or any other code in python which can help me to arrive at my expected output
Solution 1 : 
as.matrix(apply(m==1,1,function(a) paste0(colnames(m)[a], collapse = "")))

Solution 2: 
t <- which(m==1, arr.ind = TRUE)
as.matrix(aggregate(col~row, cbind(row=rownames(t), col=t[,2]), function(x) 
                                                    paste0(colnames(m)[x], collapse = "")))

how could something similar or arrive at my expected output in Python?

3 Answers3

5

You can try apply with str.join:

df["Combination"] = df.drop("Values", axis=1).apply(lambda x: "~~".join(x[x != 0].index), axis=1)

print(df)
#    Values  A1  A2  B1  B2  B3  B4  C1  C2  C3 Combination
# 0      10   1   0   1   0   0   0   1   0   0  A1~~B1~~C1
# 1      12   0   1   0   0   1   0   0   1   0  A2~~B3~~C2
# 2       3   0   1   0   1   0   0   0   0   1  A2~~B2~~C3
# 3       5   1   0   0   0   0   1   1   0   0  A1~~B4~~C1

Explanations:

  1. For calculating the Combination, ignore Values column. Several approach are possible (see this topic). Here I use drop: df.drop("Values", axis=1).
  2. Apply a custom function over each row using apply and axis=1
  3. In the function, filter values different from 0 using x[x != 0]
  4. Select column name (here there are index of the Serie) using .index
  5. Use str.join to match desired output: "~~".join(x[x != 0].index)

Full illustration:

# Step 1
print(df.drop("Values", axis=1))
#    A1  A2  B1  B2  B3  B4  C1  C2  C3
# 0   1   0   1   0   0   0   1   0   0
# 1   0   1   0   0   1   0   0   1   0
# 2   0   1   0   1   0   0   0   0   1
# 3   1   0   0   0   0   1   1   0   0

# Step 3
print(df.drop("Values", axis=1).apply(lambda x: x[x != 0], axis=1))
#     A1   A2   B1   B2   B3   B4   C1   C2   C3
# 0  1.0  NaN  1.0  NaN  NaN  NaN  1.0  NaN  NaN
# 1  NaN  1.0  NaN  NaN  1.0  NaN  NaN  1.0  NaN
# 2  NaN  1.0  NaN  1.0  NaN  NaN  NaN  NaN  1.0
# 3  1.0  NaN  NaN  NaN  NaN  1.0  1.0  NaN  NaN

# Step 4
print(df.drop("Values", axis=1).apply(lambda x: x[x != 0].index, axis=1))
# 0    Index(['A1', 'B1', 'C1'], dtype='object')
# 1    Index(['A2', 'B3', 'C2'], dtype='object')
# 2    Index(['A2', 'B2', 'C3'], dtype='object')
# 3    Index(['A1', 'B4', 'C1'], dtype='object')

# Step 5
df["Combination"] = df.drop("Values", axis=1).apply(lambda x: "~~".join(x[x != 0].index), axis=1)
print(df)
#    Values  A1  A2  B1  B2  B3  B4  C1  C2  C3 Combination
# 0      10   1   0   1   0   0   0   1   0   0  A1~~B1~~C1
# 1      12   0   1   0   0   1   0   0   1   0  A2~~B3~~C2
# 2       3   0   1   0   1   0   0   0   0   1  A2~~B2~~C3
# 3       5   1   0   0   0   0   1   1   0   0  A1~~B4~~C1
Alexandre B.
  • 5,387
  • 2
  • 17
  • 40
5
df["Combination"] = df.iloc[:, 1:].dot(df.add_suffix("~~").columns[1:]).str[:-2]

We select columns except for Values with iloc and then form a dot product where the second operand is respective columns of df with ~~ added to the end. Result gives ~~ for the very end as well, so we chop it with .str[:-2]

to get

   Values  A1  A2  B1  B2  B3  B4  C1  C2  C3 Combination
0      10   1   0   1   0   0   0   1   0   0  A1~~B1~~C1
1      12   0   1   0   0   1   0   0   1   0  A2~~B3~~C2
2       3   0   1   0   1   0   0   0   0   1  A2~~B2~~C3
3       5   1   0   0   0   0   1   1   0   0  A1~~B4~~C1
Mustafa Aydın
  • 17,645
  • 4
  • 15
  • 38
  • imho, this is an ideal solution since it avoids using `apply`. However, since it needs a 2nd string-stripping step, it would have to be timed against the `apply solution`. This would probably still be faster. – aneroid May 13 '21 at 09:03
  • 1
    Btw, you can skip the `.values` part before `.dot` so that it produces a pandas Series. And then you can put the string trimming in the same step: `df.iloc[:, 1:].dot(df.add_suffix("~~").columns[1:]).str[:-2]` – aneroid May 13 '21 at 09:10
3

You can use the .dot function of pandas after dropping your values column to get the column names you want and then process the output with list comprehensions to obtain the right A1~~B1~~C1 format.

Here's what it looks looks like:

tmp = [list(i) for i in list(df.dot(df.columns))]
df['combination'] = ['~~'.join(i) for i in tmp]

print(df)

    A   B   C   D   combination
0   1   1   0   0   A~~B
1   0   1   1   0   B~~C
2   0   0   1   0   C
zanga
  • 612
  • 4
  • 20
  • 1
    Spoke too soon. This solution has an issue if the columns have more than one character in the name. Gives outputs like `A~~1~~B~~1~~C~~1` instead of `A1~~B1~~C1`. – aneroid May 13 '21 at 08:59