Merging two identically-named columns in a dataframe

Question

I have a dataframe that looks like this:

df = pd.DataFrame({'a':[1,0,1],'b':[0,1,0],'b1':[1,0,0],'c':[0,1,1]})
df.columns = ['a','b','b','c']

>>> df
   a  b  b  c
0  1  0  1  0
1  0  1  0  1
2  1  0  0  1

I want to merge those two different b columns together, like this:

I understand that I could use | (OR) in a bitwise context to combine them, e.g. with a and c:

>>> df['a'] | df['c']
0    1
1    1
2    1
dtype: int64

But I'm having trouble selecting the two individual b columns, because of this:

>>> df['b']
   b  b
0  0  1
1  1  0
2  0  0

>>> df['b']['b']
   b  b
0  0  1
1  1  0
2  0  0

>>> df['b']['b']['b']['b']['b']['b']['b']['b']['b']['b']['b']['b']['b']['b']['b']['b']['b']['b']['b']['b']['b']['b']['b']['b']['b']['b']['b']
   b  b
0  0  1
1  1  0
2  0  0

You should specify some rules for the "join" (it is actually not a join). In this case, I interpret from the data that your columns only contain 0's or 1's, and that a 1 overwrites a 0. Is that all? — Ric S, Nov 17 '21 at 16:21
Yes @Ric. I want `b` to end up as `1,1,0` because the two `b`'s are `0,1,0` and `1,0,0`. — , Nov 17 '21 at 16:22
`df["b"].sum(axis=1).clip(0, 1)`? Or `df["b"].apply(lambda x: x[0]|x[1], axis=1)` — not_speshal, Nov 17 '21 at 16:23
`df.T.drop_duplicates().T` if you wanna drop the extra B column — Chris, Nov 17 '21 at 16:25
Nice @Chris - that's useful too. I do want drop the extra one. — , Nov 17 '21 at 16:25
Just curious as why you have duplicate column names in the first place? — Quang Hoang, Nov 17 '21 at 16:34
@Quang - I wrote [this answer](https://stackoverflow.com/a/69994937/17242583) to a different question, and I discovered later that there were duplicate columns because it's using one-hot-encoding. I asked this question because I searched quite a bit and couldn't figure how to do it. :) — , Nov 17 '21 at 16:36

score 2 · Answer 1 · answered Nov 17 '21 at 16:25

2

Try with sum and clip:

df["b"] = df["b"].sum(axis=1).clip(0, 1)

#remove duplicate column
df = df.loc[:, ~df.columns.duplicated()]

answered Nov 17 '21 at 16:25

not_speshal

22,093
2
15
30

The call to `clip` doesn't even seem necessary. – Nov 17 '21 at 16:28
1

You will get `2` instead of `1` if both "b" columns have `1` – not_speshal Nov 17 '21 at 16:28
@user17242583 To keep the other columns use `df.groupby(level=0, axis=1).sum().clip(0, 1)` (group the columns by their labels and apply the logic to each group) – Rodalm Nov 17 '21 at 16:29
1

@HarryPlotter - That would make sense if there are multiple duplicate columns. Otherwise my way is much faster (28.5 ns ± 1.04 ns per loop vs 2.55 ms ± 112 µs per loop). My code also becomes much more efficient with larger DataFrames (with 1000 times larger: 30.3 ns ± 0.809 ns per loop vs 3.41 ms ± 206 µs per loop). – not_speshal Nov 17 '21 at 16:31
1

Yes, naturally it would be slower. But in the case OP wants to generalize the logic for different groups of columns. – Rodalm Nov 17 '21 at 16:33
1

I think I'll use yours @HarryPlotter actually, because, though I love this fast little solution by not_speshal, the duplicate columns will be arbitrary, so yours will work better for my particular case. Will you please put it in an answer? – Nov 17 '21 at 16:37

score 1 · Answer 2 · edited Nov 23 '21 at 23:17

1

Beside the answer suggested by not_speshal, you could also access the columns by index as follows:

df.iloc[:, 1] | df.iloc[:, 2]

edited Nov 23 '21 at 23:17

Peter Mortensen

30,738
21
105
131

answered Nov 17 '21 at 16:29

mailach

71
5

score 1 · Accepted Answer · edited Nov 23 '21 at 23:18

1

Assuming that you have multiple groups of repeated columns, you can apply the same logic of not_speshal's solution to each group using DataFrame.groupby.

# group the columns (axis=1) by their labels (level=0) and apply the logic to each group
df = df.groupby(level=0, axis=1).sum().clip(0, 1)

edited Nov 23 '21 at 23:18

Peter Mortensen

30,738
21
105
131

answered Nov 17 '21 at 16:44

Rodalm

5,169
5
21

Merging two identically-named columns in a dataframe

3 Answers3