3

I'm working with a dataset of around 700 observations and would like to calculate how many combinations of 4 observations have a mean value between a high and low value.

I've created code that can do this, but applying it to my larger dataset results in billions of combinations and takes several days to run. Surely there is a faster or more efficient way to do this

Here is what I've tried:

import pandas as pd
import numpy as np
import itertools
from numpy import mean


df = pd.DataFrame(np.random.randint(0,100,size=(700, 2)), columns=['Value', 'Other'])

truehi = 25
truelow = 10

combs = sum(1 for e in itertools.combinations(df['Value'], 4))

meets = 0
for item in itertools.combinations(df['Value'], 4):
    avg = mean(item)
    if (avg <= truehi) & (avg >= truelow):
        meets = meets + 1

I found this question that looks like it should do what I need, but I'm having trouble adapting it to my specific case. If anyone could help that would be incredible: Efficiently count sets in a cartesian product that sum above a specific number

  • What did you find out when you profiled it? – superb rain Mar 09 '21 at 17:54
  • 1
    Are you sure about your numbers? Shouldn't just be "hundreds of millions of combinations" but "**thousands** of millions of combinations" and after some experimenting I estimate it to not take "several hours" but "several **days**". – superb rain Mar 09 '21 at 18:01
  • 1
    Citing [itertools.combinations](https://docs.python.org/3.7/library/itertools.html#itertools.combinations), "The number of items returned is `n! / r! / (n-r)!`", so in your case 700! / 4! / 696!, so 9,918,641,075 – crissal Mar 09 '21 at 18:03
  • Thank you both for your comments. I've updated the question with the number of combinations and runtime. – Ace_Puppies Mar 09 '21 at 18:56
  • Just to save a step... if its always gonna be 4 items, you don't need to calculate the mean, just the total – Sam Cohen-Devries Mar 09 '21 at 19:02
  • 1
    Maybe also see if you can do some pre-filtering so you don't have to evaluate all combinations... for example, assuming there are no negative values, filter out anything with a value higher than `truehi * 4`. You can also label all values over `truehi`, and filter out any combination of only that set, and vice versa for `truelow` – Sam Cohen-Devries Mar 09 '21 at 19:07
  • And finally, since you're working with sets, maybe consider `join` operations, either in Pandas or by hooking this up to a database, rather than looping – Sam Cohen-Devries Mar 09 '21 at 19:11

1 Answers1

2

One helpful idea here is that itertools.combinations returns the combinations in lexicographical order. So, if the input sequence is sorted, the output sequence will be sorted, too.

This helps with reducing the sequence of combinations to evaluate, since it is certain that if item[0] > truehi, then item[0] >= item[1] >= item[2] >= item[3], so there's no more combinations that will meet the conditions.

This allows us to stop the loop earlier. For some tests I ran, it skipped the last 30% or so of combinations for size 100, based on the test data with Gaussian distribution.

To extend this idea even further, I wrote a custom version of the combinations that used the same approach for level 1 and 2. That shaves off another 40% or so.

Some other ideas that improve runtime performance is to calculate number of combinations using the logarithmic version of n take 4, instead of iterating over all combinations (hinted by comment @crissal), and to use np.sum, instead of np.avg (comment @Sam cd).

For size 100, this reduces the runtime on my machine from 42s to 7.5s.

With size 200, I measure 217s runtime with the algorithm below, not evaluating 69.3% of the combinations. This is about 29 times longer, even though the number of overall combinations is only about 16.5 times bigger.

With size 300, runtime is about 514s, and in one sample, it skips 85% of the combinations.

With size 700, runtime is about 21,858s, and in one sample, it skips 79% of combinations.

import math
import pandas as pd
import numpy as np

setsz = 200 #100 #700 is target, too many combinations
df = pd.DataFrame(np.random.randint(0, 100, size=(setsz, 2)),
                  columns=['Value', 'Other'])
truehi = 25
truelow = 10
# With combinations, this is a factiorials game:
# combs has n! / ((n-4)! * 4!) combinations;
# about 10**10 for n=700
# n=100 is 3_921_225; lapse of 42.3s; sample 159_004 meets
# combs = sum(1 for e in itertools.combinations(df['Value'], 4))
log_fact_n = math.log(math.factorial(setsz), 10)
log_fact_n_4 = math.log(math.factorial(setsz-4), 10)
log_fact_4 = math.log(math.factorial(4), 10)
log_combs = log_fact_n - log_fact_n_4 - log_fact_4
meets = 0

def c_combinations(iterable, r, vmax):
    # Modified from itertools.combinations
    # combinations('ABCD', 2) --> AB AC AD BC BD CD
    # combinations(range(4), 3) --> 012 013 023 123
    pool = tuple(iterable)
    n = len(pool)
    if r > n or r < 4:
        return
    indices = list(range(r))
    yield tuple(pool[i] for i in indices)
    while True:
        for i in reversed(range(r)):
            if indices[i] != i + n - r:
                break
        else:
            return
        indices[i] += 1
        for j in range(i+1, r):
            indices[j] = indices[j-1] + 1
        if pool[indices[0]] > vmax:
            return
        sum_pool_01 = pool[indices[0]] + pool[indices[1]]
        if sum_pool_01 > 2*vmax:
            for i in reversed(range(1, r)):
                indices[i] = i + n - r
            continue
        if sum_pool_01 + pool[indices[2]] > 3*vmax:
            for i in reversed(range(2, r)):
                indices[i] = i + n - r
            continue
        yield tuple(pool[i] for i in indices)

first = None
for i,item in enumerate(c_combinations(sorted(df['Value']), 4, truehi)):
    sum = np.sum(item)
    if 4*truelow <= sum <= 4*truehi:
        if first is None:
            first = i
        meets = meets + 1
    if item[0] > truehi:
        break
print(f"{meets:,} found in 10**{log_combs:,.6} combinations")
print(f"First match {first:,}, last iteration {i:,}")
# 5,711,643 found in 10**7.8108 combinations
# First match 87, last iteration 19,889,389

One other thing to consider: with very large datasets, estimating the number of combinations meeting certain conditions could also be done by using sampling techniques. Why run through billions of combinations when you could first select a random subset and run through millions of combinations?

VirtualScooter
  • 1,792
  • 3
  • 18
  • 28