Solving a Modified Subset Sum - Search Algorithms

Question

I have a problem which is similar to a subset sum but with a few modifications:

Subset Sum: Find all ways the following list can sum to 16:

[2,9,40,2,404,12....]

My Problem: Find all ways you can select n items (integer, discrete), each which has 3 properties - length,width,and height, such that the combined total of each property is as close to a specific target as possible.

Example:

items = [(92,10,15),(8,18,34),(29,50,110) ...]
l,w,h = [150,200,180]
n = 3

We select 3 items such that the sum of the length, width, and height across all 3 items is as close as possible 150, 200, and 180. This error can be formulated as a simple distance metric:

error = sum(abs(x-y) for x,y in zip(targets,actuals))

This is a contrived example but the general problem I'm solving is the same, just with more variables and constraints.

One obvious solution is just an exhaustive graph search of some sort, but I expect there are superior approaches, perhaps in the constraint programming domain.

Note:The solution should execute in a few seconds. Finding an approximate solution quickly with low error is more important than finding the global optimum. It doesn't matter if we undershoot or overshoot the targets, which is why the error uses absolute distance.

What is a "low error"? I'm able to get within ~5% of the targets when selecting from 1M items, in < 3 seconds. I would prefer to get that error down if possible, but 1%-5% is acceptable.

Please take into consideration my prior attempts outlined below before recommending an approach.

See also https://stackoverflow.com/questions/1563271/3d-bin-packing-algorithm , https://stackoverflow.com/questions/2192087/3-dimensional-bin-packing-algorithms — BurnsBA, Aug 14 '23 at 17:07
Thanks @BurnsBA - I think I know I need to use an approximation algorithm, the question is which one, and does it outperform my random search? But you are right - this does seem very close to 3D binpacking, for which some recommendations include Deep RL which is probably beyond my reach :) — Solaxun, Aug 14 '23 at 18:08
Can you specify "as close as possible" please? E.g. say, that you are to choose between LWH of (145, 195, 180) and (140, 200, 180). You can then either argue that sum of differences is equal or you can look for something like least squares which is also common optimization method. Also, is undershooting equal to overshooting or does overshooting mean game over? Finally, n items is exact number, right? My assumption that this one is a hard constraint. — Shamis, Aug 15 '23 at 08:54
Could you add to the question itself what "just with more variables and constraints" means exactly? From comments it looks like you are dealing with potentially millions of tuples with ~10 variables each? — tobias_k, Aug 15 '23 at 09:29
@Shamis clarified the question - `n` is an integer so least squares is out since it's a non-convex problem. @tobias_k I was trying to provide the simplest version of the problem which captured it's essence, otherwise this would be a very long question. Millions of tuples is an extreme upper bound, 1M or even 500K is more likely. There won't be many more variables per tuple, maybe 5/6 instead of 3. One constraint for example which I did not include would be an `alldiff` constraint, where each tuple has a type and the combined selection can only use one of each type. — Solaxun, Aug 15 '23 at 23:26

score 1 · Answer 1 · answered Aug 13 '23 at 05:28

Chosen Solution - Random Restart Hill Climbing

After a day of experimentation and research, the best approach I've found was also the most simple one. Just repeatedly trying random configurations until I got an acceptable solution (e.g. "Random Restart Hill Climbing"). Simulated annealing works well too, but I haven't noticed a big enough improvement to justify the extra complexity compared to RRHC - however I admittedly did not spend any real effort tuning the parameters (temperature, schedule, etc.).

I should say that I'm fairly happy with this approach so far - it returns a "good enough" answer pretty quickly, is tunable if I want a more exhaustive search vs faster (more vs less trials), and is flexible in what I accept as "good enough" in the sense that I can just modify the "cost" function - in this case a simple distance between the target l,w,h and the current l,w,h. If, for example, I wanted to penalize being off on height more severely than length or width I could simply multiply the (htarget - hactual) * 2 and give more weight to that particular part of the cost.

That said, for educational purposes I want to go through the other approaches I tried, and see if maybe someone can point out where I went wrong in my backtracking approach specifically. Or alternatively, if I didn't miss anything and maybe RRHC is just a better algorithm for this type of problem, that would be nice to know as well.

This is not the first time I've found RRHC to work well for problems where you need a "good enough" solution and where finding the global optimum is "icing on the cake". In my experience it's a hammer for a lot of nails - surprisingly effective given it's simplicity. One extension I might try later is running multiple parallel threads of search - even better.

Attempted Solution - Backtracking Search

I spent quite a bit of time working on a backtracking search, but my implementation wastes time searching infeasible parts of the solution space. With enough effort I think I could come up with better heuristics to prune the search space, but for now I've given up.
The problem I ran into with the backtracking approach is the following: Let's say I want to select 4 pieces of furniture and my target l,w,h is 140,80,300.

I run the algorithm:

1st selection: Furniture(30,20,120) - all constraints good
2nd selection: Furniture(100,50,40) - all constraints good
3rd ... etc.

All constraints good simply means I haven't exceeded the targets. Additionally, at each selection step, I filter the list of furniture so that any items which would exceed the remaining l,w,h are removed.

For example, after 2nd selection we have assigned 30+100,20+50,40+120 = 130,70,160 leaving a remaining l,w,h to allocate of 140-130, 80-70, 300-160 = 10,10,140. Any item which has a l>10, w>10, or h>140 wouled be filtered out before we continue the search.

The problem is that while this filtering will make sure to exclude any items that would be in conflict at the immediate next step, it does not consider that a certain state will eventually become a dead-end at later steps. If the only pieces of furniture left now all have a length of 10, they will pass this filtering step yet we know that the current arrangement cannot possibly lead to a solution because by the fourth step there is no budget left for length - we will have exhausted it on step 3 when we assign 10 leaving zero remaining length.

Hacky workaround?

I can do a form of "forward checking" at every step:

find the min and max for each l,w,h property in the remaining furniture
if at step 2 we have 140 h remaining, but the maxh is 60, we know we can never arrive at a solution since 2*60 = 120 < 140.

In general - at every step n we filter the furniture list such that:

the max for a given attribute * n steps >= attribute remaining to be allocated.
the min for a given attribute * n steps <= attribute remaining to be allocated.

I think this would work, and terminate parts of the search space which will not be feasible much higher up in the search tree, but it feels hacky.

Conclusion

So - is the proposed solution reasonable? Do you think I missed any approaches which would be a better fit (including some of the alternative approaches I tried unsuccessfully?)

AirSquid · Answer 2 · 2023-08-14T21:26:30.053

This is a slight variation of the knapsack problem, which is very common in Linear Programming. Picking things based on some value with some limit on another characteristic.

In this case, we want to use the total "miss distance" for each dimension as the valuation (lower = better). Because we want to consider low misses and high misses, we need to introduce a variable to capture that, and constrain it to be larger than a possible "low miss" (target > sum) and also constrain it to be larger than a "high miss" (sum > target). Then just take the sum of miss in all 3 dimensions.

Many variants are possible. You could switch to "Price is Right" scoring and not allow the dimensional totals to be greater than their limits by adding those constraints, etc.

Anyhow, this is a fairly straightforward MILP. It solves for 1000 boxes almost instantly using the commented-out "GO BIG". (DO NOT try that with enumeration! Choose(2000, 1000) is a large number). The problem becomes more complicated if you don't restrict the number of boxes selected, but that's another story.

Note: As stated in the comment, I need to use a separate instance of the cbc solver on my machine. If you want to replicate this by installing pulp, it will install a default copy of cbc and you should just need the one-liner that I show beneath.

CODE:

# box selection

from random import randint
import pulp

num_boxes = 10
num_selections = 5
targets = {'l': 200, 'w': 600, 'h': 300}

# GO BIG
# num_boxes = 2000
# num_selections = 1000
# targets = {'l': 40_000, 'w': 120_000, 'h': 60_000}

# Make a list of boxes to pick from
boxes = [{'l': randint(20, 80), 'w': randint(100, 200), 'h': randint(50, 65)} for _ in range(num_boxes)]

# SET UP THE LP
prob = pulp.LpProblem('box_picker', sense=pulp.LpMinimize)

# SETS
D = ['l', 'w', 'h']
B = list(range(num_boxes))

# VARS
pick = pulp.LpVariable.dicts('pick', indices=B, cat=pulp.LpBinary)  # 1 if picked, else 0
miss = pulp.LpVariable.dicts('miss', indices=D)  # the amount that we missed the target by

# OBJ:  Minimize the total miss distance, all dimensions are equally weighted
prob += pulp.lpSum(miss)

# CONSTRAINTS:

# connect the "miss" variable to the selections * dimensions
# Because the "miss" variable represents an absolute values of the difference beteen target and total,
# we must constrain it for positive and negative misses
for dim in D:
    prob += miss[dim] >=   sum(boxes[box][dim] * pick[box] for box in B) - targets[dim]     # pos delta
    prob += miss[dim] >= -(sum(boxes[box][dim] * pick[box] for box in B) - targets[dim])    # neg delta

# force correct number of selections
prob += pulp.lpSum(pick) == num_selections
# check it
# print(prob)

# SOLVE IT
cbc_path = '/opt/homebrew/opt/cbc/bin/cbc'   # this lis likely not necessary for a normal install
solver = pulp.COIN_CMD(path=cbc_path)
res = prob.solve(solver)

# prob.solve()    # <---this is normally OK instead of 3 lines above

# make sure we got optimal result
assert prob.status == pulp.LpStatusOptimal

tot_selections = sum(pick[b].varValue for b in B)
print(f'Selected {tot_selections} Boxes:')
for b in B:
    if pick[b].varValue > 0.1:  # screen out the zeros (or small numerical errors)
        print(boxes[b])

print('\nDimensional "Misses":')
for d in D:
    print(d, miss[d].varValue)

Output:

Result - Optimal solution found

Objective value:                152.00000000
Enumerated nodes:               0
Total iterations:               0
Time (CPU seconds):             0.00
Time (Wallclock seconds):       0.00

Option for printingOptions changed from normal to all
Total time (CPU seconds):       0.00   (Wallclock seconds):       0.00

Selected 5.0 Boxes:
{'l': 39, 'w': 136, 'h': 60}
{'l': 39, 'w': 140, 'h': 53}
{'l': 64, 'w': 125, 'h': 57}
{'l': 33, 'w': 161, 'h': 57}
{'l': 51, 'w': 156, 'h': 65}

Dimensional "Misses":
l 26.0
w 118.0
h 8.0

Thank you .. I had a strong feeling some form of LP would be a possible approach! I'll test this out shortly. For reference, `num_selections` should be 1-10 at most , but `num_boxes` could be 1-10MM. — Solaxun, Aug 14 '23 at 22:05
Neither! 10MM == 10e6 ... sorry, years in the finance industry did this to me. — Solaxun, Aug 14 '23 at 23:19
Roger. That still may be an issue. 10M (10 Million) binary variables is humongous for an LP, even though you have only a few constraints. You are essentially baking a 10M column matrix (with thankfully only a few rows) to hand to the solver. TBD if that works. Perhaps you will need to pre-process your "boxes" and remove duplicates or boxes that are essentially similar where the dimensions differ only slightly to reduce that significantly. — AirSquid, Aug 14 '23 at 23:45
There are other frameworks where you can submit the `A` matrix (after you construct it to represent the problem...a different task) that may have a bit more traction. (You are executing the fundamental `Ax <= b` formulation.). `pulp` will have to build out all the equations with the sums of 10M things per row, which might not make it the best pkg to proceed, if you go down this route — AirSquid, Aug 14 '23 at 23:48
The set of boxes is unique so can't prune there, however I could remove those that are similar enough as you suggest which will help a little bit. It's broadly distributed. Another catch is there will be different "types" of boxes in the real problem, and I may have constraints about picking only "n" types of one box. For example I need a configuration of box type a,b,c,d,e (all diff) or maybe (a,a,b,b,c). Most of the effort is me just learning how to shoehorn this into a LP solver with little experience in formulating problems in that structure. — Solaxun, Aug 15 '23 at 00:12
The large search space and variety of constraints is part of the reason I went with repeated random hill climbing - it's very easy to implement, and pretty quick at finding a decent local optimum. — Solaxun, Aug 15 '23 at 00:13
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/254916/discussion-between-solaxun-and-airsquid). — Solaxun, Aug 15 '23 at 03:02