We can set this up as a constraint optimization problem which has four parts:
- Creating the variable: We will represent our choice of row selection with a boolean vector. True in k-th entry mean we’ve selected that row k and a False will mean we don't.
- Specifying the constraints: We need to insure the sum of the target rows is greater than the threshold. Done by computing the dot product between target column and vector of selected rows.
- Specify an objective function. Here the objective function is the sum of the cost of the rows selected, which is the dot product of the cost column and selected rows vector.
- Solve which in this case is to minimize the objective function subject to the constraints.
There are several Python Operations Research libraries i.e. OR Libraries for solving this type of problem. This solution uses Google OR-Tools which is an "open-source software suite for optimization.
We show that solving using an optimization package is much faster than alternative solutions that perform an exhaustive search over all possible row selections. An exhaustive search has exponential computational complexity, O(2^nrows), thus only suitable for a small number of rows (i.e. < 30).
Code
import numpy as np
import pandas as pd
# Google or-tools solver
from ortools.sat.python import cp_model
import timeit
def solve(df, threshold):
'''
Uses or-tools module to solve optimization
'''
weights = df['Target']
cost = df['Cost']
names = df['Names']
# Creates the model.
model = cp_model.CpModel()
# Step 1: Create the variables
# array containing row selection flags i.e. True if row k is selected, False otherwise
# Note: treated as 1/0 in arithmeetic expressions
row_selection = [model.NewBoolVar(f'{i}') for i in range(df.shape[0])]
# Step 2: Define the constraints
# The sum of the weights for the selected rows should be >= threshold
model.Add(weights.dot(row_selection) >= threshold)
# Step 3: Define the objective function
# Minimize the total cost (based upon rows selected)
model.Minimize(cost.dot(row_selection))
# Step 4: Creates the solver and solve.
solver = cp_model.CpSolver()
solver.Solve(model)
# Get the rows selected
rows = [row for row in range(df.shape[0]) if solver.Value(row_selection[row])]
return names[rows]
# Setup
df = pd.DataFrame({'Names': ['a', 'b', 'c', 'd', 'e', 'f'],
'Target': [35, 15, 12, 8, 7, 5],
'Cost': [15, 40, 30, 30, 25, 10]})
print(solve(df, 40))
# Output:
0 a
5 f
Name: Names, dtype: object
Performance
Current solution (based upon OR-Tools)
%timeit main(df, 40)
3.13 ms ± 111 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Compare to exhaustive search algorithms such as Scott Boston solution.
from itertools import combinations, chain
df = pd.DataFrame(
{
"Names": ["a", "b", "c", "d", "e", "f"],
"Target": [35, 15, 12, 8, 7, 5],
"Cost": [15, 40, 30, 30, 25, 10],
}
)
def powerset(iterable):
"powerset([1,2,3]) --> () (1,) (2,) (3,) (1,2) (1,3) (2,3) (1,2,3)"
s = list(iterable)
return chain.from_iterable(combinations(s, r) for r in range(len(s) + 1))
%timeit min( (df.loc[list(i)] for i in powerset(df.index) if df.loc[list(i), "Target"].sum() >= 40), key=lambda x: x["Cost"].sum(),)
64.4 ms ± 1.88 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Thus, using OR-Tools was ~20X faster than an exhaustive search (i.e. 3.13 ms vs. 64.4 ms)