3

I have a dataframe that looks like this:

df = pd.DataFrame({'a':[1,0,1],'b':[0,1,0],'b1':[1,0,0],'c':[0,1,1]})
df.columns = ['a','b','b','c']

>>> df
   a  b  b  c
0  1  0  1  0
1  0  1  0  1
2  1  0  0  1

I want to merge those two different b columns together, like this:

   a  b  c
0  1  1  0
1  0  1  1
2  1  0  1

I understand that I could use | (OR) in a bitwise context to combine them, e.g. with a and c:

>>> df['a'] | df['c']
0    1
1    1
2    1
dtype: int64

But I'm having trouble selecting the two individual b columns, because of this:

>>> df['b']
   b  b
0  0  1
1  1  0
2  0  0

>>> df['b']['b']
   b  b
0  0  1
1  1  0
2  0  0

>>> df['b']['b']['b']['b']['b']['b']['b']['b']['b']['b']['b']['b']['b']['b']['b']['b']['b']['b']['b']['b']['b']['b']['b']['b']['b']['b']['b']
   b  b
0  0  1
1  1  0
2  0  0
  • 1
    You should specify some rules for the "join" (it is actually not a join). In this case, I interpret from the data that your columns only contain 0's or 1's, and that a 1 overwrites a 0. Is that all? – Ric S Nov 17 '21 at 16:21
  • Yes @Ric. I want `b` to end up as `1,1,0` because the two `b`'s are `0,1,0` and `1,0,0`. –  Nov 17 '21 at 16:22
  • 1
    `df["b"].sum(axis=1).clip(0, 1)`? Or `df["b"].apply(lambda x: x[0]|x[1], axis=1)` – not_speshal Nov 17 '21 at 16:23
  • `df.T.drop_duplicates().T` if you wanna drop the extra B column – Chris Nov 17 '21 at 16:25
  • Nice @Chris - that's useful too. I do want drop the extra one. –  Nov 17 '21 at 16:25
  • Just curious as why you have duplicate column names in the first place? – Quang Hoang Nov 17 '21 at 16:34
  • @Quang - I wrote [this answer](https://stackoverflow.com/a/69994937/17242583) to a different question, and I discovered later that there were duplicate columns because it's using one-hot-encoding. I asked this question because I searched quite a bit and couldn't figure how to do it. :) –  Nov 17 '21 at 16:36

3 Answers3

2

Try with sum and clip:

df["b"] = df["b"].sum(axis=1).clip(0, 1)

#remove duplicate column
df = df.loc[:, ~df.columns.duplicated()]
not_speshal
  • 22,093
  • 2
  • 15
  • 30
  • The call to `clip` doesn't even seem necessary. –  Nov 17 '21 at 16:28
  • 1
    You will get `2` instead of `1` if both "b" columns have `1` – not_speshal Nov 17 '21 at 16:28
  • @user17242583 To keep the other columns use `df.groupby(level=0, axis=1).sum().clip(0, 1)` (group the columns by their labels and apply the logic to each group) – Rodalm Nov 17 '21 at 16:29
  • 1
    @HarryPlotter - That would make sense if there are multiple duplicate columns. Otherwise my way is much faster (28.5 ns ± 1.04 ns per loop vs 2.55 ms ± 112 µs per loop). My code also becomes much more efficient with larger DataFrames (with 1000 times larger: 30.3 ns ± 0.809 ns per loop vs 3.41 ms ± 206 µs per loop). – not_speshal Nov 17 '21 at 16:31
  • 1
    Yes, naturally it would be slower. But in the case OP wants to generalize the logic for different groups of columns. – Rodalm Nov 17 '21 at 16:33
  • 1
    I think I'll use yours @HarryPlotter actually, because, though I love this fast little solution by not_speshal, the duplicate columns will be arbitrary, so yours will work better for my particular case. Will you please put it in an answer? –  Nov 17 '21 at 16:37
1

Beside the answer suggested by not_speshal, you could also access the columns by index as follows:

df.iloc[:, 1] | df.iloc[:, 2]
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
mailach
  • 71
  • 5
1

Assuming that you have multiple groups of repeated columns, you can apply the same logic of not_speshal's solution to each group using DataFrame.groupby.

# group the columns (axis=1) by their labels (level=0) and apply the logic to each group
df = df.groupby(level=0, axis=1).sum().clip(0, 1) 
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Rodalm
  • 5,169
  • 5
  • 21