-1

I have a Pandas DataFrame of the form

df = pd.DataFrame({'1':['a','b','c'], '2':['b','a','d'], '3':['0.7','0.6','0.1']}).

I'd like to add a column to this DataFrame which contains the number of times a specific row is present, without considering the order (since the first two columns are the nodes of an undirected graph). Moreover, I'd like to merge those rows that differ only for the order of the first two columns, and take the mean of the numbers in the third one. In this case, it should be

df = pd.DataFrame({'1':['a','c'], '2':['b','d'], '3':['0.65','0.1'], '4':['2','1']}).

Consider also that the DataFrame contains more than 100.000 rows.

wrong_path
  • 376
  • 1
  • 6
  • 18
  • 1
    Sort your first two columns, then [aggregate for `["mean", "count"]`](https://stackoverflow.com/a/39931909/8881141)? – Mr. T Dec 01 '18 at 08:59

1 Answers1

0

Use -

a=df[['1','2']].values
a.sort(axis=1)
df[['1','2']] = a
df.groupby(['1','2'])['3'].agg(['count','mean']).reset_index()

Output

    1   2   count   mean
0   a   b   2   0.65
1   c   d   1   0.10

or

df[['1','2']] = df[['1','2']].sort_values(1,axis=1)
df.groupby(['1','2'])['3'].agg(['count','mean']).reset_index()
Vivek Kalyanarangan
  • 8,951
  • 1
  • 23
  • 42