1

Assuming that I have a dataframe of pastries

       Pastry      Flavor      Qty
0     Cupcake      Cheese      3
1   Cakeslice   Chocolate      2
2        Tart       Honey      2
3   Croissant   Raspberry      1

And I get the value count of a specific flavor per pastry

df[df['Flavor'] == 'Cheese']['Pastry'].value_counts()
Cupcake    4
Tart       4
Cakeslice  3
Turnover   3
Creampie   2
Danish     2
Bear Claw  2

Then to get the percentile of that flavor qty, I could do this

df[df['Flavor'] == 'Cheese']['Pastry'].value_counts().describe(percentiles=[.75, .85, .95]) 

And I'd get something like this (from full dataframe)

count    35.00000
mean     1.485714
std      0.853072
min      1.000000
50%      1.000000
75%      2.000000
85%      2.000000
95%      3.300000
max      4.000000

Where the total different pastries that are cheese flavored is 35, so the total cheese qty is distributed amongst those 35 pastries. The mean of qty is 1.48, max qty is 4 (cupcake and tart) etc, etc.

What I want to do is bring that 95th percentile down by counting all other values which are not 'Cheese' in the flavor column, however value_counts() is only counting the ones that are 'Cheese' because I filtered the dataframe. How can I also count the non Cheese rows, so that my percentiles will go down and will represent the distribution of Cheese total in the entire dataframe?

This is an example output:

Cupcake    4
Tart       4
Cakeslice  3
Turnover   3
Creampie   2
Danish     2
Bear Claw  2
Swiss Roll 1
Baklava    0
Cannoli    0

Where the non-cheese flavor pastries are being included with 0 as qty, from there I can just get the percentiles and they will be reduced since there are 0 values now diluting them.

Salchipapas
  • 118
  • 8
  • Are you looking for df.Qty.describe(percentiles=[.95]) – Chris May 30 '20 at 01:24
  • @Chris There is no Qty column on the original dataframe, this is an example dataframe for illustrative purposes. I get the quantities on the original dataframe from running df[df['Flavor] == 'Cheese'].value_counts() and it returns the number of 'Cheese' rows per pastry, but I want to basically include all that is not cheese as 0 so that I can dilute the percetiles. – Salchipapas May 30 '20 at 01:44
  • @Salchipapas input frame you have already shard if you can share sample of expected output would be easy to understand ,, also have look if this can take you in some direction https://stackoverflow.com/questions/62034736/pythomcompare-2-columns-and-write-data-to-excel-sheets/62038290?noredirect=1#comment109736116_62038290 – Hietsh Kumar May 30 '20 at 17:56

1 Answers1

0

I decided to go and try the long way to try and solve this question and my result gave me the same answer as this question

Here is the long way, in case anyone is curious.

pastries = {}

for p in df['Pastry'].unique():
  pastries[p] = df[(df['Flavor'] == 'Cheese') & (df['Pastry'] == p)]['Pastry'].count()

newdf = pd.DataFrame.from_dict(pastries.items())

newdf.describe(percentiles=[.75, .85, .95])
Salchipapas
  • 118
  • 8