dataframe sorting a columns and remove the duplicated value

Question

I have a dataframe df

        user_id     o_date      month

2          3      2017-05-15      4
3          3      2017-05-15      4
6          1      2017-05-25      4
22         7      2017-05-27      4
25         1      2017-05-23      4
26         3      2017-05-12      4
29         3      2017-05-13      4
39         7      2017-05-08      4
70         1      2017-05-25      4

I want to sort the 'user_id' to get a new dataframe that resulting object will be in descending order so that the first element is the most frequently-occurring element.Just like the method Series.value_counts()

I want the output like this:

       user_id     o_date      month

2          3      2017-05-15      4
3          3      2017-05-15      4
26         3      2017-05-12      4
29         3      2017-05-13      4
6          1      2017-05-25      4
25         1      2017-05-23      4
70         1      2017-05-25      4
22         7      2017-05-27      4
39         7      2017-05-08      4

So how to get the output Thx!

Edit: I get the output. Now I want to remove the duplicated user_id according to the o_date(with the same user_id I choose the o_date which frequently-occurring) just like the final result:

        user_id     o_date      month
2          3      2017-05-15      4
6          1      2017-05-25      4
22         7      2017-05-27      4

I'm new to the dataframe, thanks again!

jezrael · Accepted Answer · 2018-04-26T11:57:07.003

1

Use:

df = df.iloc[(-df['user_id'].map(df['user_id'].value_counts())).argsort()]
print (df)
    user_id      o_date  month
2         3  2017-05-15      4
3         3  2017-05-31      4
26        3  2017-05-12      4
29        3  2017-05-13      4
6         1  2017-05-25      4
25        1  2017-05-23      4
70        1  2017-05-17      4
22        7  2017-05-27      4
39        7  2017-05-08      4

Explanation:

1.First get counts by value_counts

print (df['user_id'].value_counts())
3    4
1    3
7    2
Name: user_id, dtype: int64

2.map column user_id

print (df['user_id'].map(df['user_id'].value_counts()))
2     4
3     4
26    4
29    4
6     3
25    3
70    3
22    2
39    2
Name: user_id, dtype: int64

3.Get argsort in descendent order for positions:

print ((-df['user_id'].map(df['user_id'].value_counts())).argsort())
2     0
3     1
26    2
29    3
6     4
25    5
70    6
22    7
39    8
Name: user_id, dtype: int64

4.And last select by iloc for new ordering

EDIT: For remove dupes by column use drop_duplicates:

df = df.drop_duplicates('user_id')
print (df)
    user_id      o_date  month
2         3  2017-05-15      4
6         1  2017-05-25      4
22        7  2017-05-27      4

edited Apr 26 '18 at 11:57

answered Apr 26 '18 at 11:22

jezrael

822,522
95
1,334
1,252

1

you help me a lot. Thank you – th000 Apr 26 '18 at 12:03
I change the original `df`(`o_date`) maybe the `drop_duplicates` can't reach the goal. I want to remove the duplicated `user_id` according to the `o_date`(with the same `user_id` I choose the `o_date` which frequently-occurring) – th000 Apr 26 '18 at 12:23
@user9673692 I am now offline, on phone only. Can you create new question? Thank you. – jezrael Apr 26 '18 at 12:36
I’m sorry because I'm a newer to the StackOverflow, so I am limited to ask one question per week, maybe you can help me when you online.Thanks! – th000 Apr 26 '18 at 12:44
@user9673692 not sure if understand. Why think drop_duplicates does not work? Does not need first duplicated row by column 'user_id' ? Or need sort date column first like `df = df.sort_values('o_date').drop_duplicates('user_id')` or `df = df.sort_values('o_date', ascending=False).drop_duplicates('user_id')` ? – jezrael Apr 26 '18 at 13:47
@ jezrael `drop_duplicates` will drop duplicates except for the first occurrence by default.But I want choose the `o_date` which occurs most when they have the same `user_id`. For example, we choose `2017-05-15 `(`user_id=3`) because it occurs 2 times and others occur 1 times, not just the `2017-05-15 ` appear the first line. – th000 Apr 27 '18 at 01:11
@th000 - I think I understand now. So first get all dupes like `df = df[df.duplicated(['user_id','o_date'], keep=False)].groupby('user_id')['o_date'].apply(lambda x: x.value_counts().index[0])` and `groupby` + `value_counts` is if more dupes per `user` then get only date with most occurence. (in real data maybe not necessary, not sure). – jezrael Apr 27 '18 at 06:03
@th000 Maybe I dont understand what you need :( – jezrael Apr 27 '18 at 07:20

score 0 · Answer 2 · answered Apr 26 '18 at 11:42

Answer of jezrael is a really cool one-liner, but you can also add some supporting "count" column in case you need to monitor frequency:

df['count'] = df.groupby('user_id')['user_id'].transform(pd.Series.value_counts)
df.sort_values('count', ascending=False)

Output:

        month  user_id  count
2       4        3      4
3       4        3      4
26      4        3      4
29      4        3      4
6       4        1      3
25      4        1      3
70      4        1      3
22      4        7      2
39      4        7      2

thanks for your help, I edit my question, can you help me out? — th000, Apr 26 '18 at 12:46

dataframe sorting a columns and remove the duplicated value

2 Answers2