How to speed up linear programming problem solution?

Question

I have the following problem. I have n (typically n = 1000) data points (integers from {1,2,3}, so there are a lot of repeting numbers) and a real number d. I have to choose k<n points (k is given) which minimize the distance between the mean of those k points and point d. This can be expressed as a MILP problem (please see here).

I tried to solve that in R using lpSolve and Rglpk packages but it takes a lot of time to solve it (I tried to solve it for n = 100 points and the code has been running for 40 minutes already). I guess the issue is that there are lots of binary variables (n) and there are also repeating numbers.

library(Rglpk)
set.seed(123)
sampsize <- sample(c(1,2,3),size=100,replace = TRUE)
k <- 50
d <- 86/47
lngth <- length(sampsize)
f.obj <- c(1,rep(0,lngth))
f.con <- matrix(c(0,rep(1,lngth),-1,sampsize/k,1,sampsize/k),nrow=3, byrow = TRUE)
f.dir <- c("==","<=",">=")  
f.rhs <- c(k,d,d)

f.types <- c("C",rep("B",lngth))

res <- Rglpk_solve_LP(obj=f.obj,mat=f.con,dir=f.dir,rhs=f.rhs,max=FALSE,types=f.types)

I will be satisfied with a sub-optimal solution. Is there a way to solve it quickly or re-express the problem in a certain way to speed up the algorithm?

I would appreciate any input on this.

See: https://stackoverflow.com/help/minimal-reproducible-example — Erwin Kalvelagen, May 31 '22 at 01:34
If your tolerance requirements for a sub-optimal solution are not too high and you are not limiting yourself to LP, maybe a monte-carlo like approach (https://pastebin.com/K6ATp3hZ) works for you. — benjamin, Jun 02 '22 at 15:17

score 4 · Answer 1 · answered Jun 02 '22 at 00:33

The below is written in python, but I think the concept conveys very easily and can be reformulated in r if desired.

Basically: Reformulate your problem. Instead of optimizing a long vector of binary "selection" variables, all you need is 3 variables to formulate this, specifically the (integer) number of 1's, 2's, and 3's to pick.

This solves almost instantaneously as an IP.

import pyomo.environ as pyo
from random import randint

n = 1000
k = 500

sample = [randint(1, 3) for t in range(n)]

avail = {t : len([val for val in sample if val==t]) for t in range(1, 4)}

target = 86/47

m = pyo.ConcreteModel()

m.vals = pyo.Set(initialize=[1,2,3])

m.pick = pyo.Var(m.vals, domain=pyo.NonNegativeIntegers)

m.delta = pyo.Var()

m.obj = pyo.Objective(expr=m.delta)

# constrain the delta as an absolute value of |sum(picks) - target|
m.C1 = pyo.Constraint(expr=m.delta >= sum(m.pick[v]*v for v in m.vals)-target*k)
m.C2 = pyo.Constraint(expr=m.delta >= -sum(m.pick[v]*v for v in m.vals)+target*k)

# don't use more than available for each value
def limit(m, v):
    return m.pick[v] <= avail[v]
m.C3 = pyo.Constraint(m.vals, rule=limit)

soln = pyo.SolverFactory('glpk').solve(m)
print(soln)
m.pick.display()

Yields:

Solver: 
- Status: ok
  Termination condition: optimal
  Statistics: 
    Branch and bound: 
      Number of bounded subproblems: 885
      Number of created subproblems: 885
  Error rc: 0
  Time: 0.3580749034881592
Solution: 
- number of solutions: 0
  number of solutions displayed: 0

pick : Size=3, Index=vals
    Key : Lower : Value : Upper : Fixed : Stale : Domain
      1 :     0 :   3.0 :  None : False : False : NonNegativeIntegers
      2 :     0 :   0.0 :  None : False : False : NonNegativeIntegers
      3 :     0 : 304.0 :  None : False : False : NonNegativeIntegers

Realize you can also attack this algorithmically quite efficiently and get a (pretty easy) near-optimal answer, or with some sweat-equity get the optimal answer as well. Below is a framework that I tinkered with. The key observation is that you can "add more 3's" to the solution up until the point where the amount to go (to get to k * target can be filled all with 1's. That's very close to as-good-as-it-gets, except for cases where you'd be better off substituting a couple of 2's near the end, I think, or backing up if you run out of 1's.

The below runs (in python) and is most of the way there for a good approximation.

### Code:

# average hitting

from random import randint
n = 1000
k = 50

sample = [randint(1, 3) for t in range(n)]

available = {t : len([val for val in sample if val==t]) for t in range(1, 4)}

target = 86/47

print(f'available at start: {available}')


sum_target = target * k
soln = []
selections_remaining = k
togo = sum_target - sum(soln)
for pick in range(k):
    if togo > k - pick and available[3] > 0:
        soln.append(3)
        available[3] -= 1
    elif togo > k - pick and available[2] > 0:
        soln.append(2)
        available[2] -= 1
    elif available[1] > 0:
        soln.append(1)
        available[1] -= 1
    else: # ran out of ones in home stretch... do a swap
        pass
        # some more logic...

    togo = sum_target - sum(soln)

print(f'solution: {soln}') 
print(f'generated: {sum(soln)/k} for target of {target}')

print(f'leftover: {available}')

Yields:

available at start: {1: 349, 2: 335, 3: 316}
solution: [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
generated: 1.84 for target of 1.8297872340425532
leftover: {1: 291, 2: 335, 3: 274}
[Finished in 117ms]

Erwin Kalvelagen · Accepted Answer · 2022-06-02T04:02:51.660

CVXR is a much better tool for this:

#
# generate random data
#

set.seed(123)
N <- 100           # sample size
v <- c(1,2,3)      # sample from these unique values
M <- length(v)     # number of unique values

data <- sample(v, size=N, replace=TRUE)
tab <- table(data) # tabulate  

K <- 50            # number of points to choose
target <- 86/47    # target for mean 

#
#  CVXR model
#  see https://cvxr.rbind.io/
#
library(CVXR)

# select a number of values from each bin
select <- Variable(M,integer=T) 

# obj: sum of absolute deviations
objective = Minimize(abs(sum(v*select)/K-target))
# include nonnegativity constraints
constraints = list(sum(select)==K, select >= 0, select <= vec(tab))
problem <- Problem(objective, constraints)

sol <- solve(problem,verbose=T)

cat("\n")
cat("Status:",sol$status,"\n")
cat("Objective:",sol$value,"\n")
cat("Solution:",sol$getValue(select),"\n")

Output:

GLPK Simplex Optimizer, v4.65
9 rows, 4 columns, 17 non-zeros
      0: obj =   0.000000000e+00 inf =   5.183e+01 (2)
      3: obj =   5.702127660e-01 inf =   0.000e+00 (0)
*     4: obj =   1.065814104e-16 inf =   0.000e+00 (0)
OPTIMAL LP SOLUTION FOUND
GLPK Integer Optimizer, v4.65
9 rows, 4 columns, 17 non-zeros
3 integer variables, none of which are binary
Integer optimization begins...
Long-step dual simplex will be used
+     4: mip =     not found yet >=              -inf        (1; 0)
+    55: >>>>>   1.021276596e-02 >=   9.787234043e-03   4.2% (52; 0)
+    56: >>>>>   9.787234043e-03 >=   9.787234043e-03 < 0.1% (16; 36)
+    56: mip =   9.787234043e-03 >=     tree is empty   0.0% (0; 103)
INTEGER OPTIMAL SOLUTION FOUND

Status: optimal 
Objective: 0.009787234 
Solution: 26 7 17

How to speed up linear programming problem solution?

2 Answers2

Yields:

Yields: