0

I have a Pandas Dataframe with about 30_000 records and would like to find all the records for a specific column whose combined count is less than 10. The Dataframe contains clinical trial data and the column I need to filter and update are diseases for each trial. There are diseases that appear in numerous clinical trials so I need to first filter out all the diseases that appear less than 10 times and than for those diseases, change those text to a new string called 'other'. All this information needs to be than updated in that same column.

This is the code that I have come up with but JupyterLab seems to freeze when I try to run it.

df_diseases = df.groupby(['Diseases']).filter(lambda x: x['Diseases'].count() < 10).apply(lambda x: x.replace(x,'other')) 
V Chau
  • 13
  • 4

2 Answers2

1

You can use groupby().transform():

s = df.groupby('Diseases')['Diseaes'].transform('count')
df.loc[s < 10, 'Disease'] = 'other'

Or you can use value_counts and map:

s = df['Diseases'].value_counts()

df['Dieases'] = np.where(df['Dieases'].map(s) > 10, df['Dieaseas'], 'other')
Quang Hoang
  • 146,074
  • 10
  • 56
  • 74
  • I am not sure how this code allows me to replace the records with count less than 10 with the string of 'other' back to the original dataframe column. I don't have any problems getting the groupby count values, I am having issues getting the string 'other' to be printed back to that same column to replace the values currently in those rows. – V Chau Oct 25 '19 at 13:37
  • Once you locate those lines, you can either use `loc` or `np.where`. See update. – Quang Hoang Oct 25 '19 at 13:40
  • Awesome! Works exactly like I need it to. Thank you. – V Chau Oct 25 '19 at 14:29
0

The answer to your question may be found here (look for the Pedro M Duarte's answer): Get statistics for each group (such as count, mean, etc) using pandas GroupBy?

Jānis Š.
  • 532
  • 3
  • 14
  • I've read through the post but not sure how it answers my question about replacing specific string for instances where records are less than 10 and than writing it back to the same column with the updated string of 'other'. – V Chau Oct 25 '19 at 13:11