0

I have a data frame, something like this:

Id  Col1    Col2    Paired_Id
1   a       A
2   c       B
A       b   1
B       d   2

I would like to merge the rows to get the output something like this. Delete the paired row after merging.

Id  Col1    Col2    Paired_Id
1   a   b   A
2   c   d   B

Any hint?

So: Merging rows (ID) with its Paired_ID entries. Is this possible with Pandas?

ifly6
  • 5,003
  • 2
  • 24
  • 47
Bharath
  • 13
  • 1
  • Please reformat your tables. It isn't obvious looking at the underlying markdown which cells belong to which columns. If you're asking how to merge, see below. – ifly6 Feb 08 '23 at 21:10
  • Does this answer your question? [Pandas Merging 101](https://stackoverflow.com/questions/53645882/pandas-merging-101) – ifly6 Feb 08 '23 at 21:10

2 Answers2

2

Assuming NaNs in the empty cells, I would use a groupby.first with a frozenset of the two IDs as grouper:

group = df[['Id', 'Paired_Id']].apply(frozenset, axis=1)

out = df.groupby(group, as_index=False).first()

Output:

  Id Col1 Col2 Paired_Id
0  1    a    b         A
1  2    c    d         B
mozway
  • 194,879
  • 13
  • 39
  • 75
0

Don't have a lot of information about the structure of your dataframe, so I will just assume a few things - please correct me if I'm wrong:

  • A line with an entry in Col1 will never have an entry in Col2.
  • Corresponding lines appear in the same sequence (lines 1,2,3... then corresponding lines 1,2,3...)
  • Every line has a corresponding second line later on in the dataframe

If all those assumptions are correct, you could split your data into two dataframes, df_upperhalf containing the Col1, df_lowerhalf the Col2.

df_upperhalf = df.iloc[:len(df.index),]
df_lowerhalf = df.iloc[(len(df.index)*(-1):,]

Then you can easily combine those values:

df_combined = df_upperhalf
df_combined['Col2'] = df_lowerhalf['Col2']

If some of my assumptions are incorrect, this will of course not produce the results you want.

There are also quite a few ways to do it in fewer lines of code, but I think this way you end up with nicer dataframes and the code should be easily readable.

Edit:

I think this would be quite a bit faster:

df_upperhalf = df.head(len(df.index))
df_lowerhalf = df.tail(len(df.index))