0

I am trying to sort data by the Name column, by popularity.

Right now, I'm doing this:

df['Count'] = df.apply(lambda x: len(df[df['Name'] == x['Name']]), axis=1)
df[df['Count'] > 50][['Name', 'Description', 'Count']].drop_duplicates('Name').sort_values('Count', ascending=False).head(100)

However this query is very slow, it takes hours to run.

What would be a more efficient way to do this?

if __name__ is None
  • 11,083
  • 17
  • 55
  • 71

3 Answers3

2

The solution I have been looking for is:

df['Count'] = df.groupby('Name')['Name'].transform('count')

Big thanks to @Lynob for providing a link with an answer.

if __name__ is None
  • 11,083
  • 17
  • 55
  • 71
1

You can use Series.value_counts.

df = pd.DataFrame([[0, 1], [1, 0], [1, 1]], columns=['a', 'b'])
print(df['b'].value_counts())

outputs

1    2
0    1
Name: b, dtype: int64
Alex
  • 18,484
  • 8
  • 60
  • 80
0

Try this:

a = ["jim"]*5  + ["jane"]*10 + ["john"]*15 
n = pd.Series(a)

sorted((n.value_counts()[n.value_counts() > 5]).index)

['jane', 'john']
Merlin
  • 24,552
  • 41
  • 131
  • 206