3

I have a pandas DataFrame with the next columns - "A", "B", "C", "D". I want to merge the rows of the DataFrame that has the following condition -

if my DataFrame is called df:

(df.at[i,"A"] == df.at[j, "B"]) and (df.at[j,"A"] == df.at[i,"B"])

For example -

df = pd.DataFrame([[1,2,10,0.55],[3,4,5,0.3],[2,1,2,0.7]], columns=["A","B","C","D"]) 

Which gives -

In [93]: df                                                                                                                                     
Out[93]: 
   A  B   C     D
0  1  2  10  0.55
1  3  4   5  0.30
2  2  1   2  0.70

In the example above rows 0 and 2 has the condition. I know for sure that there can be at most 2 rows that correspond to this condition. For the rows that has this condition I would like to sum the "C" values, Average the "D" and remove the redundant row. In the example above I would like to get -

In [95]: result                                                                                                                                     
Out[95]: 
   A  B   C      D
0  1  2  12  0.625
1  3  4   5  0.300

Or

In [95]: result                                                                                                                                     
Out[95]: 
   A  B   C      D
0  2  1  12  0.625
1  3  4   5  0.300

I tried the following code that was very slow:

def remove_dups(path_to_df: str):
    df = pd.read_csv(path_to_df)
    for i in range(len(df)):
        a = df.at[i, "A"]
        b = df.at[i, "B"]
        same_row = df[(df["A"] == b) & (df["B"] == a)]
        if same_row.empty:
            continue
        c = df.at[i, "C"]
        d = df.at[i, "D"]
        df.drop(i, inplace=True)
        new_ind = same_row.index[0]
        df.at[new_ind, "C"] += c
        df.at[new_ind, "D"] = (df.at[new_ind, "D"] + distance) / 2
    return df

Is there a way to accomplish this using only built-in Pandas functions?

Guy
  • 63
  • 1
  • 7
  • 1
    Hi. Please take the time to read this post on [how to provide a great pandas example](http://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) as well as how to provide a [minimal, complete, and verifiable example](http://stackoverflow.com/help/mcve) and revise your question accordingly. These tips on [how to ask a good question](http://stackoverflow.com/help/how-to-ask) may also be useful. – jezrael Apr 29 '19 at 07:56
  • 1
    Hello, thanks for the comment. I've edited the question and added an example that can be copy-pasted to an interpreter. Is the question clearer now? If not, please tell me – Guy Apr 29 '19 at 08:23

1 Answers1

2

Use numpy.sort first and then GroupBy.agg:

df[['A','B']] = np.sort(df[['A','B']], axis=1)

df = df.groupby(['A','B'], as_index=False).agg({'C':'sum', 'D':'mean'})
print (df)
   A  B   C      D
0  1  2  12  0.625
1  3  4   5  0.300

If original values cannot be changed:

arr = np.sort(df[['A','B']], axis=1)

df = (df.groupby([arr[:, 0],arr[:, 1]])
       .agg({'C':'sum', 'D':'mean'})
       .rename_axis(('A','B'))
       .reset_index())
print (df)
   A  B   C      D
0  1  2  12  0.625
1  3  4   5  0.300
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252