1

I have been working with a dataframe in Pandas that contains duplicate entries along with non-duplicates in a column. The dataframe looks something like this:

    country_name  values  category
0   country_1     10       a
1   country_2     20       b
2   country_1     50       a
3   country_2     10       b
4   country_3     100      c
5   country_4     10       d

I want to write something that converts(replaces) duplicates with their mean values in my dataframe. An ideal output would be something similar to the following:

    country_name  values    category
0   country_1      30        a
1   country_2      15        b
2   country_3      100       c
3   country_4      10        d

I have been struggling with this for a while so I would appreciate any help. I have forgotten to add category column. The problem with groupby() method as you now when you call mean() it does not return category column back. My solution was to take numerical columns and the column that has duplicates together apply groupby().mean() then concatenate back to the categorical columns. So I am looking for a solution shorten than what I have done.
My method get tedious when you are dealing with many categorical columns.

Ozkan Serttas
  • 947
  • 13
  • 14

1 Answers1

5

You can use df.groupby():

df.groupby('country_name').mean().reset_index()
cors
  • 527
  • 4
  • 11
  • 1
    Thank you for your reply. I fixed my question. So the real issue is the categorical columns when you apply `groupby()` method for removing duplicates. Sorry for the confusion. – Ozkan Serttas Jan 11 '19 at 23:04
  • @cyber-math in this case, the code below works: `df.groupby('country_name').agg({'values': 'mean', 'category': lambda x: pd.unique(x)}).reset_index()` – cors Jan 11 '19 at 23:19
  • ah, but it will not work when you have more complicated categorical column – cors Jan 11 '19 at 23:21
  • one more attempt: `df[~df.duplicated(subset=['country_name', 'category'])].groupby('country_name').transform('mean').join(df[['country_name', 'category']][~df.duplicated(subset=['country_name', 'category'])])` – cors Jan 11 '19 at 23:36