4

I want to make a random sample selection in python from the following df such that at least 65% of the resulting sample should have color yellow and cumulative sum of the quantities selected to be less than or equals to 18.

Original Dataset:

Date        Id      color       qty
02-03-2018  A       red         5
03-03-2018  B       blue        2
03-03-2018  C       green       3
04-03-2018  D       yellow      4
04-03-2018  E       yellow      7
04-03-2018  G       yellow      6
04-03-2018  H       orange      8
05-03-2018  I       yellow      1
06-03-2018  J       yellow      5

I have got total qty. selected condition covered but stuck on how to move forward with integrating the % condition:

df2 = df1.sample(n=df1.shape[0])

df3= df2[df2.qty.cumsum() <= 18]

Required dataset:

Date        Id      color       qty
03-03-2018  B       blue        2
04-03-2018  D       yellow      4
04-03-2018  G       yellow      6
06-03-2018  J       yellow      5

Or something like this:

Date        Id      color       qty
02-03-2018  A       red         5
04-03-2018  D       yellow      4
04-03-2018  E       yellow      7
05-03-2018  I       yellow      1

Any help would be really appreciated!

Thanks in advance.

Analytics_TM
  • 493
  • 6
  • 28
  • There are ways to answer your question, but I wanted to pose a frame challenge. I don't think the question makes sense. Lets say I expanded your question to be "random sample of people at least 65% blue-eyed who have a combined age of 100". Without context, we must interpret "random" as "randomly chosen from all possible choices such that all are equally likely". There are about 1B blue eyed people, and the average size of a "random sample" will probably be around 2B (completely ignoring your age requirement, which presumably is there for a reason). What other constraints do you have? – Cireo Mar 12 '19 at 17:29

3 Answers3

3
  1. Filter rows with 'yellow' and select a random sample of at least 65% of your total sample size

    import random
    yellow_size = float(random.randint(65,100)) / 100
    df_yellow = df3[df3['color'] == 'yellow'].sample(yellow_size*sample_size)
    
  2. Filter rows with other colors and select a random sample for the remaining of your sample size.

    others_size = 1 - yellow_size
    df_others = df3[df3['color'] != 'yellow].sample(others_size*sample_size)
    
  3. Combine them both and shuffle the rows.

    df_sample = pd.concat([df_yellow, df_others]).sample(frac=1)
    

UPDATE:

If you want to check for both conditions simultaneously, this could be one way to do it:

import random

df_sample = df

while sum(df_sample['qty']) > 18:
    yellow_size = float(random.randint(65,100)) / 100
    df_yellow = df[df['color'] == 'yellow'].sample(yellow_size*sample_size)
    others_size = 1 - yellow_size
    df_others = df[df['color'] != 'yellow'].sample(others_size*sample_size)
    df_sample = pd.concat([df_yellow, df_others]).sample(frac=1)
panktijk
  • 1,574
  • 8
  • 10
  • Awesome, thanks @panktijk for the speedy response! The only thing is how do we make sure that the resulting dataset (df_sample) would also have the cumsum of quantities less than or equal to 18. – Analytics_TM Mar 11 '19 at 23:39
  • yeah, but we want both conditions to be true simultaneously at the time of sample selection. Does this make sense? – Analytics_TM Mar 11 '19 at 23:47
  • Ah okay! I thought you wanted to sample from the already filtered dataframe. I'm traveling rn. Give me some time and I'll look into how to incorporate both conditions together. – panktijk Mar 11 '19 at 23:53
  • So another way to do this would be to keep generating samples until your condition is satisfied. I'd still suggest you go ahead with the sequential approach i.e. calculate `df3` first (like you're currently doing) and then sample rows from that. – panktijk Mar 12 '19 at 01:24
  • The only caveat here would be- we might end up selecting random sample in which the cumulative sum condition would be true but we may not get enough color condition which should be checked while selecting the sample. – Analytics_TM Mar 12 '19 at 02:28
  • Here, I think we are assuming that `sample_size` is given, but it is not mentioned by the OP. – hacker315 Mar 12 '19 at 03:27
  • @hacker315 Since the question mentioned 65%, I assumed that the fraction was 65% out of something. Otherwise it doesn't really make sense to say 65% without knowing the total size. – panktijk Mar 12 '19 at 20:31
  • @AnalyticsTeam Updated the answer – panktijk Mar 13 '19 at 00:08
  • @panktijk Thanks for all the effort! This is really helpful and to answer your question re: sample size- the sample size would be dependent on the cumulative sum condition so the sample chosen would have the total sum of qty (selected rows) less than or equals to 18 and 65% of those selected rows should have color yellow. – Analytics_TM Mar 13 '19 at 20:40
  • @panktijk Thanks for the response.While the code works as required, however could you please confirm if the `sample_size=len(df)`, would the code still pull meaningful result. I have tried this on my end but was facing issues. – Analytics_TM Mar 19 '19 at 03:34
0

I would use this package to over sample your yellows into a new sample that has the balance you want:

https://imbalanced-learn.readthedocs.io/en/stable/over_sampling.html

From there just randomly select items and check sum until you have the set you want.

Something less time complex would be binary searching a range the length of your data frame, and using the binary search term as your sample size, until you get the cumsum you want. The assumes the feature is symmetrically distributed.

John R
  • 1,505
  • 10
  • 18
0

I think this example help you. I add columns df2['yellow_rate'] and calculate rate. You only check df2.iloc[df2.shape[0] - 1]['yellow_rate'] value.

df1=pd.DataFrame({'id':['A','B','C','D','E','G','H','I','J'],'color':['red','bule','green','yellow','yellow','yellow','orange','yellow','yellow'], 'qty':[5,2, 3, 4, 7, 6, 8, 1, 5]})
df2 = df1.sample(n=df1.shape[0])
df2['yellow_rate'] =  df2[df2.qty.cumsum() <= 18]['color'].apply( lambda x : 1 if x =='yellow' else 0)
df2 = df2.dropna().append(df2.sum(numeric_only=True)/ df2.count(numeric_only=True), ignore_index=True)
mtgarden
  • 77
  • 1
  • 5
  • Hi @mtgarden, thanks for this. Looks like the code is only making sure that the `cumsum of qty<=18` but doesn’t ensure that 65% of the resulting dataframe contains ‘Yellow’. We would require both conditions to be true simultaneously. Is there something else we can try. – Analytics_TM Mar 19 '19 at 03:22