Pandas find subset of rows minimizing the sum of a column under other column constraint

Question

I have this very simple idea of finding the subset of rows that minimizes the sum of a column, while the sum of another column has to be greater than a certain value.

Example:

df = pd.DataFrame({'Names': ['a', 'b', 'c', 'd', 'e', 'f'],
               'Target': [35, 15, 12, 8, 7, 5],
               'Cost': [15, 40, 30, 30, 25, 10]})




    Names   Target  Cost
0   a       35      15
1   b       15      40
2   c       12      30
3   d       8       30
4   e       7       25
5   f       5       10

In this example above, I would like to find the subset of rows that minimizes the Cost column, while the sum of Target has to be greater than 40.

In this example, the function I'm looking to build would return ['a', 'f'] because the constraint 35 + 5 >= 40 is satisfied and the Cost 15 + 10 = 25 can't be lower with any other combination of rows while satisfying the constraint.

What libraries or ideas am I looking for to resolve this problem?

If small then use complete enumeration (see the answers below). If the data set is large, you may want to consider using a proper optimization model. — Erwin Kalvelagen, Oct 07 '21 at 01:37

DarrylG · Accepted Answer · 2021-10-07T14:38:23.397

We can set this up as a constraint optimization problem which has four parts:

Creating the variable: We will represent our choice of row selection with a boolean vector. True in k-th entry mean we’ve selected that row k and a False will mean we don't.
Specifying the constraints: We need to insure the sum of the target rows is greater than the threshold. Done by computing the dot product between target column and vector of selected rows.
Specify an objective function. Here the objective function is the sum of the cost of the rows selected, which is the dot product of the cost column and selected rows vector.
Solve which in this case is to minimize the objective function subject to the constraints.

There are several Python Operations Research libraries i.e. OR Libraries for solving this type of problem. This solution uses Google OR-Tools which is an "open-source software suite for optimization.

We show that solving using an optimization package is much faster than alternative solutions that perform an exhaustive search over all possible row selections. An exhaustive search has exponential computational complexity, O(2^nrows), thus only suitable for a small number of rows (i.e. < 30).

Code

import numpy as np 
import pandas as pd

# Google or-tools solver
from ortools.sat.python import cp_model

import timeit

def solve(df, threshold):
    '''
    Uses or-tools module to solve optimization

    '''
    weights = df['Target']
    cost = df['Cost']
    names = df['Names']

    # Creates the model.
    model = cp_model.CpModel()

    # Step 1: Create the variables
    # array containing row selection flags i.e. True if row k is selected, False otherwise
    # Note: treated as 1/0 in arithmeetic expressions
    row_selection = [model.NewBoolVar(f'{i}') for i in range(df.shape[0])]

    # Step 2: Define the constraints
    # The sum of the weights for the selected rows should be >= threshold
    model.Add(weights.dot(row_selection) >= threshold)

    # Step 3: Define the objective function
    # Minimize the total cost (based upon rows selected)
    model.Minimize(cost.dot(row_selection))

    # Step 4: Creates the solver and solve.
    solver = cp_model.CpSolver()
    solver.Solve(model)

    # Get the rows selected
    rows = [row for row in range(df.shape[0]) if solver.Value(row_selection[row])]

    return names[rows]


# Setup
df = pd.DataFrame({'Names': ['a', 'b', 'c', 'd', 'e', 'f'],
               'Target': [35, 15, 12, 8, 7, 5],
               'Cost': [15, 40, 30, 30, 25, 10]})

print(solve(df, 40))

# Output:
0    a
5    f
Name: Names, dtype: object

Performance

Current solution (based upon OR-Tools)

%timeit main(df, 40)
3.13 ms ± 111 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Compare to exhaustive search algorithms such as Scott Boston solution.

from itertools import combinations, chain
    
df = pd.DataFrame(
        {
            "Names": ["a", "b", "c", "d", "e", "f"],
            "Target": [35, 15, 12, 8, 7, 5],
            "Cost": [15, 40, 30, 30, 25, 10],
        }
    )
    
    
def powerset(iterable):
        "powerset([1,2,3]) --> () (1,) (2,) (3,) (1,2) (1,3) (2,3) (1,2,3)"
        s = list(iterable)
        return chain.from_iterable(combinations(s, r) for r in range(len(s) + 1))
    
    
%timeit min( (df.loc[list(i)] for i in powerset(df.index) if df.loc[list(i), "Target"].sum() >= 40), key=lambda x: x["Cost"].sum(),)

64.4 ms ± 1.88 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Thus, using OR-Tools was ~20X faster than an exhaustive search (i.e. 3.13 ms vs. 64.4 ms)

Fundamentally all solutions provided here work well indeed most by brute_forcing possibilities, but yours is really efficient and works for much bigger datasets whereas an exhaustive search will always crash after a certain data length threshold (above 30 rows would already crash for me). — Waroulolz, Oct 07 '21 at 10:20
@Waroulolz--that's the idea. Since exhaustive search has exponential complexity i.e. O(2^n), where n is the number of rows, it becomes too inefficient for many real-world problems. — DarrylG, Oct 07 '21 at 11:56
@Waroulolz--by the way your algorithm seems applicable to a problem proposed by George Stigler, a 1984 Nobel Prize winner in economics. The problem is called the [Stigler diet](https://en.wikipedia.org/wiki/Stigler_diet) and relates to a diet that minimizes the cost of foods while obtaining a nutritional value above a threshold. — DarrylG, Oct 08 '21 at 07:30

score 1 · Answer 2 · answered Oct 07 '21 at 01:08

Start with module imports, dummy data, and desired threshold of 40.

import pandas as pd
import itertools

df = pd.DataFrame({'Names': ['a', 'b', 'c', 'd', 'e', 'f'],
               'Target': [35, 15, 12, 8, 7, 5],
               'Cost': [15, 40, 30, 30, 25, 10]})

max_value = 40 # greater than or equal to

Iterate over all target values, including row index.

numbers = [(index, target) for index, names, target, cost in df.itertuples()]
#print(numbers)

Select all combinations that exceed threshold AND sum of all values minus minimum value is less than threshold (avoid incorporating additional values).

# reference: https://stackoverflow.com/questions/34517540/find-all-combinations-of-a-list-of-numbers-with-a-given-sum
combinations = [(x[0] for x in seq) for i in range(len(numbers), 0, -1) \
                for seq in itertools.combinations(numbers, i) \
                if sum([x[1] for x in seq]) >= max_value and sum([x[1] for x in seq]) - min([x[1] for x in seq]) < max_value]
combinations = list(set(tuple(x) for x in combinations))
#print(combinations)

Calculate combination with minimum cost and return relevant values.

min_cost, min_index, min_name = None, None, None
for combination in combinations:
    sum_cost = sum([df["Cost"].loc[x] for x in combination])
    if min_cost == None or sum_cost < min_cost:
        min_cost, min_index, min_name = sum_cost, combination, [df["Names"].loc[x] for x in combination]
print("minimum cost:", min_cost)
#print("Index combination:", min_index) # if Names not unique
print("Names combination:", min_name)

Output

minimum cost: 25
Names combination: ['a', 'f']

Scott Boston · Answer 3 · 2021-10-07T01:45:39.410

Let's use itertools recipe powerset:

import pandas as pd
import numpy as np
from itertools import combinations, chain

df = pd.DataFrame(
    {
        "Names": ["a", "b", "c", "d", "e", "f"],
        "Target": [35, 15, 12, 8, 7, 5],
        "Cost": [15, 40, 30, 30, 25, 10],
    }
)


def powerset(iterable):
    "powerset([1,2,3]) --> () (1,) (2,) (3,) (1,2) (1,3) (2,3) (1,2,3)"
    s = list(iterable)
    return chain.from_iterable(combinations(s, r) for r in range(len(s) + 1))


df_out = min(
    (
        df.loc[list(i)]
        for i in powerset(df.index)
        if df.loc[list(i), "Target"].sum() >= 40
    ),
    key=lambda x: x["Cost"].sum(),
)

df_out

Output:

  Names  Target  Cost
0     a      35    15
5     f       5    10

Details:

Use itertools recipe "powerset" to find all combinations or the index from zero to the length of the dataframe. Then, using list comprehension create a generator that on returns those combinations where the target sums to 40 or greater. Lastly, use python min function with key parameter to return the dataframe/combination with the lowest cost.

score 0 · Answer 4 · answered Oct 07 '21 at 03:46

This is a little bit of a brute force way to solve the problem, but it will work:

from itertools import combinations

df = pd.DataFrame({'Names': ['a', 'b', 'c', 'd', 'e', 'f'],
               'Target': [35, 15, 12, 8, 7, 5],
               'Cost': [15, 40, 30, 30, 25, 10]})

lst=[]
for x in range(2, (len(df.Target)+1)):
    lst+=[y for y in list(combinations(list(zip(list(df.Target), list(df.Cost))),x))]
    
lst1=[]
count=-1
for x in lst:
    count+=1
    v0=0
    v1=0
    for y in x:
        v0+=y[0]
        v1+=y[1]
    lst1.append((count,v0, v1))

df01=pd.DataFrame(lst1, columns=["ID", "target_sum", "cost_sum"])
ans=lst[df01[df01.target_sum>=40].cost_sum.idxmin()]

Answer=[df[(df.Target==ans[x][0]) & (df.Cost==ans[x][1])].Names.values[0] for x in range(0, len(ans))]

Output:

['a', 'f']

Pandas find subset of rows minimizing the sum of a column under other column constraint

4 Answers4

Linked