How to retrieve rows from DataFrame based on their first appearance

Question

I generated a dataset that shows the similarity between users in a graph based on their neighbors. Based on a dataset that shows the trust relations between users in a social network, I'm aiming to build a new dataset that contains the most similar users to my "trustor" user (e.g. the 3 most similar ones) by using a similarity evaluation algorithm. I have listed the users in descending order so that the first time a new "trustor" appears, his/her most similar users appear at first.

new_trust.sort_values(['truster','value'],ascending=False)

So basically I need to keep only the 3 first appearances of each user in my dataframe. I tried to do a for i in range(new_trust.len()): but couldn't quite find it.

Duplicate - https://stackoverflow.com/questions/20069009/pandas-get-topmost-n-records-within-each-group — jezrael, May 13 '21 at 10:25

Deusdeorum · Accepted Answer · 2021-05-13T10:28:23.340

0

If user is being your column truster you can use a groupby and get the first 3 appearances.

arr = {'truster':{0:1642,1:1642,2:1642,3:1642,4:1642,5:2,6:2,7:2,8:2,9:2},'trustee':{0:1570,1:524,2:1039,3:1545,4:1360,5:1388,6:658,7:1078,8:1336,9:1157},'value':{0:'0,08',1:'0,0533333',2:'0,04',3:'0,04',4:'0,022857',5:'0,001175',6:'0,001169',7:'0,001169',8:'0,001169',9:'0,000902'}}
df_ = pd.from_dict(arr)
df = df_.groupby(['truster']).head(3)

   truster  trustee      value
0     1642     1570       0,08
1     1642      524  0,0533333
2     1642     1039       0,04
5        2     1388   0,001175
6        2      658   0,001169
7        2     1078   0,001169

An other solution would be to use cumcount:

df_['tmp_seq'] = df_.groupby(['truster']).cumcount()
df = df_.loc[df_['tmp_seq'] < 3]

edited May 13 '21 at 10:28

answered May 13 '21 at 10:20

Deusdeorum

1,426
2
14
23

I interpret it as if the algoritm of similarity was already done, and that OP only wants to keep the 3 first cols after the sorts "`So basically I need to keep only the 3 first appearances of each user in my dataframe`" – Deusdeorum May 13 '21 at 10:23
Ya, so then it is dupe - https://stackoverflow.com/questions/20069009/pandas-get-topmost-n-records-within-each-group – jezrael May 13 '21 at 10:24
The nlargest seems to return a series while the one aforementioned returns a dataftame. I can easily now concat() the two dataframes to add the n most similar users as trustees. – George_Bast May 13 '21 at 10:32

How to retrieve rows from DataFrame based on their first appearance

1 Answers1