python/pandas: random sampling does not work in for loop

Question

I am working on a pandas data frame and I am trying to subset my data frame such that the cumulative sum of the column is not greater than 18 and then the percentage of yellow color selected should not be less than 65% and then trying to run multiple iterations of the same. However sometimes loop goes into infinite loop and sometime it does produce the results but we get the same result in every iteration.

Everything after the while loop was taken from the below post Python random sample selection based on multiple conditions

df=pd.DataFrame({'id':['A','B','C','D','E','G','H','I','J','k','l','m','n','o'],'color':['red','red','orange','red','red','red','red','yellow','yellow','yellow','yellow','yellow','yellow','yellow'], 'qty':[5,2, 3, 4, 7, 6, 8, 1, 5,2, 3, 4, 7, 6]})

df_sample = df

for x in range(2):
    sample_s = df.sample(n=df.shape[0])
    sample_s= sample_s[(sample_s.qty.cumsum()<= 30)]
    sample_size=len(sample_s)
    while sum(df['qty']) > 18:
        yellow_size = 0.65
        df_yellow = df[df['color'] == 'yellow'].sample(int(yellow_size*sample_size))
        others_size = 1 - yellow_size
        df_others = df[df['color'] != 'yellow'].sample(int(others_size*sample_size))
        df = pd.concat([df_yellow, df_others]).sample(frac=1)
    print df

This is how I get the result when it works wherein both the results are same.

   color id  qty
     red  H    2
  yellow  n    3
  yellow  J    5
     red  G    2
  yellow  I    1
     red  D    4

    color id  qty
     red  H    2
  yellow  n    3
  yellow  J    5
     red  G    2
  yellow  I    1
     red  D    4

I am really hoping if someone could please help to resolve the issue.

Your question and the linked question are not really clear. Are you looking to find _all_ possible subsets that fit _both_ criteria? Or just one? for example: Is (Yellow, n, 3) an acceptable subset? It meets both criteria... Or is this an optimization where you are looking for the best subset for some criteria that is not mentioned? — AirSquid, Mar 20 '19 at 04:50
Hi @JeffH, sorry about the confusion.I want to find all possible subsets that fit both criteria simultaneously. Thanks — Analytics_TM, Mar 20 '19 at 04:54
Mmmmkay. So, you could/should do a little research on dynamic programming, which can be used to make subsets. By random sampling, you may/may not ever find one or all of them. There is no guarantee. I think you are making life tough on yourself by doing this in a DataFrame. Is there a limit on the size of the original set? — AirSquid, Mar 20 '19 at 05:00
@JeffH in terms of limit - are you asking about the no. of rows? the actual dataset will have around 90k to 100k rows. i am relatively new to python programming, if you could please guide or share some references to tackle this that would be really great — Analytics_TM, Mar 20 '19 at 05:15
Yowza. If the total number of rows was small (maybe 50) I was going to suggest using a brute force approach and just enumerating all of the subsets, which could be done with `itertools combinations`. But with 100K, enumeration blows up. You essentially have what is called a knapsack problem (google it). But you are trying to find all possible valid solutions. This is relevant [https://stackoverflow.com/questions/30007102/number-of-all-combinations-in-knapsack-task] — AirSquid, Mar 20 '19 at 05:25
I really think you are going to have problems tackling this for a dataset of this size if the data above is representative. Let's just say that 20% of your data is "Yellow", so 20K items and if around 4 of those would fit within the sum limit you have, then just looking at those 20K items along renders 100 trillion possible legal subsets (choose(20k, 4)). And we haven't even considered any of the other 80% of the data..... This problem as stated is intractable. If you just want to find _some_ of the legal subsets, that is a different story. — AirSquid, Mar 20 '19 at 05:45
@JeffH I understand the point you are making as finding all the possible subsets would blow up , however what if I try to find only 20 subset of my original data set which satisfies the above two conditions, how would that work out. — Analytics_TM, Mar 20 '19 at 07:51

python/pandas: random sampling does not work in for loop

0 Answers0