I have a pandas DataFrame with the next columns - "A", "B", "C", "D". I want to merge the rows of the DataFrame that has the following condition -
if my DataFrame is called df:
(df.at[i,"A"] == df.at[j, "B"]) and (df.at[j,"A"] == df.at[i,"B"])
For example -
df = pd.DataFrame([[1,2,10,0.55],[3,4,5,0.3],[2,1,2,0.7]], columns=["A","B","C","D"])
Which gives -
In [93]: df
Out[93]:
A B C D
0 1 2 10 0.55
1 3 4 5 0.30
2 2 1 2 0.70
In the example above rows 0 and 2 has the condition. I know for sure that there can be at most 2 rows that correspond to this condition. For the rows that has this condition I would like to sum the "C" values, Average the "D" and remove the redundant row. In the example above I would like to get -
In [95]: result
Out[95]:
A B C D
0 1 2 12 0.625
1 3 4 5 0.300
Or
In [95]: result
Out[95]:
A B C D
0 2 1 12 0.625
1 3 4 5 0.300
I tried the following code that was very slow:
def remove_dups(path_to_df: str):
df = pd.read_csv(path_to_df)
for i in range(len(df)):
a = df.at[i, "A"]
b = df.at[i, "B"]
same_row = df[(df["A"] == b) & (df["B"] == a)]
if same_row.empty:
continue
c = df.at[i, "C"]
d = df.at[i, "D"]
df.drop(i, inplace=True)
new_ind = same_row.index[0]
df.at[new_ind, "C"] += c
df.at[new_ind, "D"] = (df.at[new_ind, "D"] + distance) / 2
return df
Is there a way to accomplish this using only built-in Pandas functions?