0

I have a dataframe with multiple scores and multiple dates. My goal is to bin each day into equal sized buckets (let's say 5 buckets) based on whatever score I choose. The problem is that some scores have an abundance of ties and therefore I need to first compute rank to introduce a tie-breaker criteria and then the qcut can be applied.

The simple solution is to create a field for the rank and then do groupby('date')['rank'].transform(pd.qcut). However, since efficiency is key, this implies doing two expensive groupbys and I was wondering if it is possible to "chain" the two operations into one sweep.

This is the closest I got; my goal is to create 5 buckets but the qcut seems to be wrong since it is asking me to provide hundreds of labels

 df_main.groupby('date')['score'].\
         apply(lambda x: pd.qcut(x.rank(method='first'),
                                 5,
                                 duplicates='drop',
                                 labels=lbls)
               )  

Thanks

Yuca
  • 6,010
  • 3
  • 22
  • 42
  • Apply() is quite slow - it may be quicker to do groupby and eat it that way,,,https://stackoverflow.com/questions/41472327/slow-performance-of-pandas-groupby-apply – Paul Brennan Nov 19 '20 at 20:43
  • thank you for the feedback. Recall that my baseline scenario is to do 2 groupbys and my goal is doing just one – Yuca Nov 19 '20 at 20:52
  • if you order your dataframe according to score, then take the 20% within each group? or am i missing something important? – StupidWolf Nov 19 '20 at 20:58
  • the ordering might be expensive as well, no? it's a good idea but sadly it won't work for what I need. I oversimplified for clarity sake, but I need to keep all the 5 groups and compare them, so I can't just keep the top n rows – Yuca Nov 19 '20 at 21:01

0 Answers0