Choose r outcomes from n possibilities efficiently in Pandas

Question

I have a 50 years data. I need to choose the combination of 30 years out of it such that the values corresponding to them reach a particular threshold value but the possible number of combination for 50C30 is coming out to be 47129212243960. How to calculate it efficiently?

          Prs_100      
  Yrs                                                 
  2012  425.189729  
  2013  256.382494  
  2014  363.309507  
  2015  578.728535  
  2016  309.311562  
  2017  476.388839  
  2018  441.479570  
  2019  342.267756  
  2020  388.133403  
  2021  405.007245  
  2022  316.108551  
  2023  392.193322  
  2024  296.545395  
  2025  467.388190  
  2026  644.588971  
  2027  301.086631  
  2028  478.492618  
  2029  435.868944  
  2030  467.464995  
  2031  323.465049  
  2032  391.201598  
  2033  548.911349  
  2034  381.252838  
  2035  451.175339  
  2036  281.921215  
  2037  403.840004  
  2038  460.514250  
  2039  409.134409  
  2040  312.182576 
  2041  320.246886  
  2042  290.163454  
  2043  381.432168  
  2044  259.228592  
  2045  393.841815  
  2046  342.999972  
  2047  337.491898  
  2048  486.139010  
  2049  318.278012  
  2050  385.919542  
  2051  309.472316  
  2052  307.756455  
  2053  338.596315  
  2054  322.508536  
  2055  385.428138  
  2056  339.379743  
  2057  420.428529  
  2058  417.143175 
  2059  361.643381  
  2060  459.861622  
  2061  374.359335

I need only that 30 years combination whose Prs_100 mean value reaches upto a certain threshold , I can then break from calculating further outcomes.On searching SO , I found a particular approach using an apriori algorithm but couldn't really figure out the values of support in it.

I have used the combinations method of python

 list(combinations(dftest.index,30))

but it was not working in this case.

Expected Outcome- Let's say I found a 30 years set whose Prs_100 mean value is more than 460 , then I'll save that 30 years output as a result and it will be my desired outcome too. How to do it ?

I think it's a little unclear what your expected results is. Can you take the n largest? — Andy Hayden, Jan 30 '19 at 07:16
n is 50 in this case which is total number of years. Here is a question which would help you to understand the problem more clearly - https://stackoverflow.com/questions/4941753/is-there-a-math-ncr-function-in-python — Bing, Jan 30 '19 at 07:18
and this too - https://www.geeksforgeeks.org/itertools-combinations-module-python-print-possible-combinations/ — Bing, Jan 30 '19 at 07:20
one way is to use np.choice a few times (until it's over the threshold... but it may not get there). Easiest solution is to use the 30 largest years, right? — Andy Hayden, Jan 30 '19 at 07:23
Actually the Prs_100 values doesn't depend on which years they are taken, so we can get the right combination in first 5 combinations or may be last 5 combinations so using np.choice a few times may or may not converge to result — Bing, Jan 30 '19 at 07:26

score 1 · Answer 1 · answered Jan 30 '19 at 07:12

1

You can use numpy's random.choice:

In [11]: df.iloc[np.random.choice(np.arange(len(df)), 3)]
Out[11]:
         Prs_100
Yrs
2023  392.193322
2047  337.491898
2026  644.588971

answered Jan 30 '19 at 07:12

Andy Hayden

359,921
101
625
535

The issue with this method is random choice could generate same number twice , thus rendering the whole subset useless – Bing Jan 30 '19 at 11:04
@Bing you can use the replace=False kwarg – Andy Hayden Jan 30 '19 at 15:38

run-out · Accepted Answer · 2019-01-31T18:43:09.350

1

My previous answer was off base so I'm going to try again. From re-reading your question it looks like you are looking for one result of 30 years where the mean of Prs_100 values is greater than 460.

The following code can do this, but when I ran it, I had started having difficulties after about 415 for a mean value.

After running, you get a list of years 'years_list' and a list of values 'Prs_100_list' meeting the criteria of mean > 460 (415 in the example below).

Here is my code, hope this is in the area of what you are looking for.

from math import factorial
import numpy as np
import pandas as pd
from itertools import combinations
import time

# start a timer
start = time.time()

# array of values to work with, corresponding to the years 2012 - 2062
prs_100 = np.array([
       425.189729, 256.382494, 363.309507, 578.728535, 309.311562,
       476.388839, 441.47957 , 342.267756, 388.133403, 405.007245,
       316.108551, 392.193322, 296.545395, 467.38819 , 644.588971,
       301.086631, 478.492618, 435.868944, 467.464995, 323.465049,
       391.201598, 548.911349, 381.252838, 451.175339, 281.921215,
       403.840004, 460.51425 , 409.134409, 312.182576, 320.246886,
       290.163454, 381.432168, 259.228592, 393.841815, 342.999972,
       337.491898, 486.13901 , 318.278012, 385.919542, 309.472316,
       307.756455, 338.596315, 322.508536, 385.428138, 339.379743,
       420.428529, 417.143175, 361.643381, 459.861622, 374.359335])

# build dataframe with prs_100 as index and years as values, so that  years can be returned easily.
df = pd.DataFrame(list(range(2012, 2062)), index=prs_100, columns=['years'])

df.index.name = 'Prs_100'

# set combination parameters
r =  30
n = len(prs_100)

Prs_100_list = []
years_list = []
count = 0    

for p in combinations(prs_100, r):
    if np.mean(p) > 391 and np.mean(p) < 400:
        Prs_100_list.append(p)
        years_list.append(df.loc[p,'years'].values.tolist())
        # build in some exit
        count += 1
        if count > 100: 
            break

edited Jan 31 '19 at 18:43

answered Jan 30 '19 at 19:43

run-out

3,114
1
9
25

The mean of your highest 30 values is 435.88. You will not find 30 years with a mean value greater than 460. If you sort your numpy array in descending order you will get results quickly. -np.sort(-prs_100) – run-out Jan 30 '19 at 22:04
okay, this worked but still it is taking a hell lot of time on gpu, is there any way which I could write it in optimised way so that it could run faster on gpu. – Bing Jan 31 '19 at 11:15
I think trying to get 460 mean is impossible for the dataset given. The highest mean is for 435. At 415 I could find values quickly if sorting the array in descending order. Over 418 I couldn't get results. I think that's just the way the data is. You are looking for statistical outliers in a 47 trillion combination set. – run-out Jan 31 '19 at 11:24
Actually, there was a problem in posting the data, I need the mean corresponding to the range 391-400 for the data given and also the coefficient of variation (either 15,25 and 35) – Bing Jan 31 '19 at 11:27
In that case you can get tons of list. Instead of breaking the loop, just build a list and then stop when you have enough items. – run-out Jan 31 '19 at 11:28
I mean in the range 391-400 is - Any set of 30 values which lies in between 391-400 – Bing Jan 31 '19 at 11:29
1

for p in combinations(prs_100, r): if np.mean(p) > 391 and np.mean(p) < 400: Prs_100_list= p years_list = df.loc[p,'years'].values.tolist() break – run-out Jan 31 '19 at 11:30
Actually , I know the condition but I am just asking Is there more optimised way to code it so that it could expedite the calculation on my nvidia 1080 ti – Bing Jan 31 '19 at 11:32

Choose r outcomes from n possibilities efficiently in Pandas

2 Answers2

Linked