One helpful idea here is that itertools.combinations
returns the combinations
in lexicographical order. So, if the input sequence is sorted, the output
sequence will be sorted, too.
This helps with reducing the sequence of combinations to evaluate, since it
is certain that if item[0] > truehi
, then item[0] >= item[1] >= item[2] >= item[3]
, so there's no more combinations that
will meet the conditions.
This allows us to stop the loop earlier. For some tests I ran, it skipped the
last 30% or so of combinations for size 100, based on the test data with
Gaussian distribution.
To extend this idea even further, I wrote a custom version of the combinations
that used the same approach for level 1 and 2. That shaves off another 40% or so.
Some other ideas that improve runtime performance is to calculate number of
combinations using the logarithmic version of n take 4
, instead of iterating
over all combinations (hinted by comment @crissal),
and to use np.sum
, instead of np.avg
(comment @Sam cd).
For size 100, this reduces the runtime on my machine from 42s to 7.5s.
With size 200, I measure 217s runtime with the algorithm below, not
evaluating 69.3% of the combinations. This is about 29 times longer,
even though the number of overall combinations is only about 16.5 times
bigger.
With size 300, runtime is about 514s, and in one sample, it skips
85% of the combinations.
With size 700, runtime is about 21,858s, and in one sample, it skips
79% of combinations.
import math
import pandas as pd
import numpy as np
setsz = 200 #100 #700 is target, too many combinations
df = pd.DataFrame(np.random.randint(0, 100, size=(setsz, 2)),
columns=['Value', 'Other'])
truehi = 25
truelow = 10
# With combinations, this is a factiorials game:
# combs has n! / ((n-4)! * 4!) combinations;
# about 10**10 for n=700
# n=100 is 3_921_225; lapse of 42.3s; sample 159_004 meets
# combs = sum(1 for e in itertools.combinations(df['Value'], 4))
log_fact_n = math.log(math.factorial(setsz), 10)
log_fact_n_4 = math.log(math.factorial(setsz-4), 10)
log_fact_4 = math.log(math.factorial(4), 10)
log_combs = log_fact_n - log_fact_n_4 - log_fact_4
meets = 0
def c_combinations(iterable, r, vmax):
# Modified from itertools.combinations
# combinations('ABCD', 2) --> AB AC AD BC BD CD
# combinations(range(4), 3) --> 012 013 023 123
pool = tuple(iterable)
n = len(pool)
if r > n or r < 4:
return
indices = list(range(r))
yield tuple(pool[i] for i in indices)
while True:
for i in reversed(range(r)):
if indices[i] != i + n - r:
break
else:
return
indices[i] += 1
for j in range(i+1, r):
indices[j] = indices[j-1] + 1
if pool[indices[0]] > vmax:
return
sum_pool_01 = pool[indices[0]] + pool[indices[1]]
if sum_pool_01 > 2*vmax:
for i in reversed(range(1, r)):
indices[i] = i + n - r
continue
if sum_pool_01 + pool[indices[2]] > 3*vmax:
for i in reversed(range(2, r)):
indices[i] = i + n - r
continue
yield tuple(pool[i] for i in indices)
first = None
for i,item in enumerate(c_combinations(sorted(df['Value']), 4, truehi)):
sum = np.sum(item)
if 4*truelow <= sum <= 4*truehi:
if first is None:
first = i
meets = meets + 1
if item[0] > truehi:
break
print(f"{meets:,} found in 10**{log_combs:,.6} combinations")
print(f"First match {first:,}, last iteration {i:,}")
# 5,711,643 found in 10**7.8108 combinations
# First match 87, last iteration 19,889,389
One other thing to consider: with very large datasets, estimating the number
of combinations meeting certain conditions could also be done by using sampling
techniques. Why run through billions of combinations when you could first select
a random subset and run through millions of combinations?