How to merge rows by same value in different columns using Python (Pandas)

Question

I have a data frame, something like this:

Id  Col1    Col2    Paired_Id
1   a       A
2   c       B
A       b   1
B       d   2

I would like to merge the rows to get the output something like this. Delete the paired row after merging.

Id  Col1    Col2    Paired_Id
1   a   b   A
2   c   d   B

Any hint?

So: Merging rows (ID) with its Paired_ID entries. Is this possible with Pandas?

Please reformat your tables. It isn't obvious looking at the underlying markdown which cells belong to which columns. If you're asking how to merge, see below. — ifly6, Feb 08 '23 at 21:10
Does this answer your question? [Pandas Merging 101](https://stackoverflow.com/questions/53645882/pandas-merging-101) — ifly6, Feb 08 '23 at 21:10

mozway · Accepted Answer · 2023-02-08T21:19:20.473

2

Assuming NaNs in the empty cells, I would use a groupby.first with a frozenset of the two IDs as grouper:

group = df[['Id', 'Paired_Id']].apply(frozenset, axis=1)

out = df.groupby(group, as_index=False).first()

Output:

  Id Col1 Col2 Paired_Id
0  1    a    b         A
1  2    c    d         B

edited Feb 08 '23 at 21:19

answered Feb 08 '23 at 21:16

mozway

194,879
13
39
75

It's been a long time since I've seen frozenset :-) – Corralien Feb 08 '23 at 21:19
@Corralien actually I think `np.sort` might be more efficient – mozway Feb 08 '23 at 21:19
1

2 times faster for np.sort. – Corralien Feb 08 '23 at 21:22
@Corralien which exact code did you use? – mozway Feb 08 '23 at 21:26
Just a timeit, the groupby doesn't work as it :-) – Corralien Feb 08 '23 at 21:27
1

Yes that's what I thought, there is an additional cost to convert the 2D array to grouper – mozway Feb 08 '23 at 21:29

Sasha-Mercedes Fischer · Answer 2 · 2023-02-08T21:36:49.730

Don't have a lot of information about the structure of your dataframe, so I will just assume a few things - please correct me if I'm wrong:

A line with an entry in Col1 will never have an entry in Col2.
Corresponding lines appear in the same sequence (lines 1,2,3... then corresponding lines 1,2,3...)
Every line has a corresponding second line later on in the dataframe

If all those assumptions are correct, you could split your data into two dataframes, df_upperhalf containing the Col1, df_lowerhalf the Col2.

df_upperhalf = df.iloc[:len(df.index),]
df_lowerhalf = df.iloc[(len(df.index)*(-1):,]

Then you can easily combine those values:

df_combined = df_upperhalf
df_combined['Col2'] = df_lowerhalf['Col2']

If some of my assumptions are incorrect, this will of course not produce the results you want.

There are also quite a few ways to do it in fewer lines of code, but I think this way you end up with nicer dataframes and the code should be easily readable.

Edit:

I think this would be quite a bit faster:

df_upperhalf = df.head(len(df.index))
df_lowerhalf = df.tail(len(df.index))

How to merge rows by same value in different columns using Python (Pandas)

2 Answers2