-1

I stumbled upon this post: Weighted percentile using numpy for weighted percentiles. All of the solutions that were given were outputting slightly different results. I'm still relatively new to this community and I wasn't sure whether creating a new post would be best to ask this question, so please let me know if this is not the best place to do this.

I don't think I fully understand the math behind the formulas and the solutions that were provided. I was wondering if someone knows the math behind the percentile formulas.

  • Welcome to SO! :-) I would suggest you to decouple this question a little bit more from the other post you've linked. What is your exact problem and why does no of the answers of the linked question solve your problem? To me, the difference between your question and the linked question is not entirely clear. – André Aug 01 '22 at 08:10
  • Thanks for the feedback. I'll edit the post and let me know if this clears out any confusion. – stock_exchange Aug 01 '22 at 08:12
  • The reason why I decided to create a new post for this is that the link I included did not help me find the information I was looking for. The post mentions all the different ways of calculating percentiles using weights that all yield different results. I was hoping to find an exact match of a more efficient solution suggested from the post. – stock_exchange Aug 01 '22 at 08:18
  • What is the output? 88 is the output for i=90. – Corralien Aug 01 '22 at 08:41
  • Yes and that would be the correct output. Looking for alternative solutions that will also output 88 for the 90th percentile without having to use the repeat() function to create weights dataframe. – stock_exchange Aug 01 '22 at 08:48

1 Answers1

0

Try:

q = np.arange(0, 1, 0.1)
out = pd.DataFrame({'qtl': q * 100
                    'num': np.nanquantile(np.repeat(df['num'], df['obs']), q)})
print(out)

# Output
    qtl   num
0   0.0   1.0
1  10.0   4.0
2  20.0  11.0
3  30.0  11.0
4  40.0  14.0
5  50.0  45.0
6  60.0  67.0
7  70.0  67.0
8  80.0  88.0
9  90.0  88.0
Corralien
  • 109,409
  • 8
  • 28
  • 52
  • In my post I mentioned that the piece of code I shared is very computation expensive because of the repeat() function and was looking for some more efficient solutions. The solution you provided can also be problematic if the data is large. – stock_exchange Aug 01 '22 at 08:52
  • Are you sure the repeat is the bottleneck? – Corralien Aug 01 '22 at 08:55
  • Unfortunately yes. If my data for example has 5 bilion rows and there are multiple repeated observations, this is going to create a giant array which is a CPU overload. – stock_exchange Aug 01 '22 at 08:56
  • I think I just don't really understand the math behind the formulas so I edited my question. I think that will help me better understand what's going on conceptually so then I can apply it to my problem. – stock_exchange Aug 01 '22 at 23:26