14

I have individuals that belong to different categories, they are located in different zones, these populations are expected to grow from the population value below to the demand value.

population_and_demand_by_category_and_zone <- tibble::tribble(
  ~category, ~zone, ~population, ~demand,
        "A",     1,         115,     138,
        "A",     2,         121,     145,
        "A",     3,         112,     134,
        "A",     4,          76,      91,
        "B",     1,          70,      99,
        "B",     2,          59,      83,
        "B",     3,          86,     121,
        "B",     4,         139,     196,
        "C",     1,         142,     160,
        "C",     2,          72,      81,
        "C",     3,          29,      33,
        "C",     4,          58,      66,
        "D",     1,          22,      47,
        "D",     2,          23,      49,
        "D",     3,          16,      34,
        "D",     4,          45,      96
)

Zones have a given capacity, current population is below this threshold, but demand will exceed capacity in some zones.

demand_and_capacity_by_zone <- tibble::tribble(
  ~zone, ~demand, ~capacity, ~capacity_exceeded,
      1,     444,       465,              FALSE,
      2,     358,       393,              FALSE,
      3,     322,       500,              FALSE,
      4,     449,       331,               TRUE
)

So we will need to move those individuals to a new zone (we assume we have enough total capacity). Each individual that we need to move incurs a cost, which depends on its category and destination zone. These costs are given below.

costs <- tibble::tribble(
  ~category, ~zone, ~cost,
        "A",     1,   0.1,
        "A",     2,   0.1,
        "A",     3,   0.1,
        "A",     4,   1.3,
        "B",     1,  16.2,
        "B",     2,  38.1,
        "B",     3,   1.5,
        "B",     4,   0.1,
        "C",     1,   0.1,
        "C",     2,  12.7,
        "C",     3,  97.7,
        "C",     4,  46.3,
        "D",     1,  25.3,
        "D",     2,   7.7,
        "D",     3,  67.3,
        "D",     4,   0.1
)

I wish to find the distribution of individuals across zones and categories so that the total cost is minimized. So basically have a new column new_population in the table population_and_demand_by_category_and_zone described above.

If several solutions are possible, any will do, if the result is a non integer population, that's fine.

The real use case has about 20 categories and 30 zones, so bigger but not all that big.

It seems like a problem that would be common enough so I'm hoping that there is a convenient way to solve this in R.

moodymudskipper
  • 46,417
  • 11
  • 121
  • 167
  • 1
    This feels related: https://stackoverflow.com/q/58434400/11374827 – teunbrand Sep 03 '21 at 14:50
  • 1
    This is a relatively straightforward Linear Program. Have you used any of the LP tools in R and do you have a solver installed? – AirSquid Sep 03 '21 at 19:46
  • I haven't and no :) – moodymudskipper Sep 04 '21 at 10:19
  • 1
    I think total capacity is not enough to handle all demand. You don't say what that implies. – Erwin Kalvelagen Sep 04 '21 at 18:06
  • You are right, it was a mistake, total capacity is assumed to be sufficient at all time, i edited the data – moodymudskipper Sep 04 '21 at 18:27
  • I read the motivation behind the bounty you have set on the question and I think that it would be very hard to do it without a linear solver, unless you find a way to rewrite a linear solver from scratch directly in R and apply it to your case. It more or less looks like a request asking to train a deep learning model without an external library. Is there a reason for that? – nicola Sep 07 '21 at 15:04
  • Cleary you might have solid reasons behind your requirement, but it's not obvious to me what would be the difference between installing an R package and a library. For instance, some packages embed the corresponding library (think for instance `RSQLite` which has `sqlite` embedded). What if, say, `Rglpk` had `glpk` inside itself? And, just for reference, `lpSolve` does not require an external library. – nicola Sep 07 '21 at 15:24
  • For instance who talked about Symphony? There are tons of solvers out there; some rely on more or less standard libraries, some others have the solvers inside the package. You stated that the answer you received requires Symphony, which is not true (there are a lot of alternatives). You said that you don't want to install extra software and I don't get why an R package is not extra software and a system library is. A simple google search regarding "linear programming in r" pops up `lpSolve` and many others in the top spots. You can then search what are needed for each to be installed. – nicola Sep 07 '21 at 16:18
  • To be concise: your bounty motivation reads as "I want a pure R solution without installing anything", since there is no logical difference between an R package and a system library. – nicola Sep 07 '21 at 16:21
  • @nicola I edited the question, I hope it makes things clear – moodymudskipper Sep 07 '21 at 17:01
  • You should really go deeper in the MILP world. Basically every solver has the same R interface (minor details aside) so, if you just follow a tutorial, you will be able to translate to code the answer you received. I guess that doing this job for you (especially since you are talking about a "client" and likely making money for it) is a bit too much. You already know that `lpSolve` can be installed without much trouble. The answer is more than enough for you and for whoever reads to solve the problem. – nicola Sep 07 '21 at 21:27
  • If you really need help on how to implement a model in `lpSolve` just ask another question showing where you are stuck and what does not behave as expected. – nicola Sep 07 '21 at 21:28
  • Thanks for the help this far. I believe the question is legit, and formulated so it can help others, and that what I'll do with the answer is irrelevant. A good SO answer is generally more helpful than a tutorial. If I find the answer myself I'll post it here too. – moodymudskipper Sep 07 '21 at 23:08
  • I also think that your question is legit. However I also think that it has been answered by 1) providing the framework to solve it (MILP) 2) modelling the problem (the system of equalities and inequalities for the objective function and the constraints). I have doubts on the bounty motivations, which make little sense, from the "external software" thing to Symphony "requirement" (?). Your question is not "I have this MILP problem, how to implement it in some library": this would be *another* legit question that I suggest you to ask at this point, provided that you show some effort. – nicola Sep 08 '21 at 07:37
  • Maybe use OMPR with GLPK. `install.packages("ROI.plugin.glpk")` installs the solver as dependency. The model is a small LP so heavy machinery like Symphony is not needed. – Erwin Kalvelagen Sep 08 '21 at 11:19
  • Thanks Erwin, unfortunately installation fails and installation of the library outside of R is needed. I could install and try lpSolve though, so trying to translate the solution with it – moodymudskipper Sep 08 '21 at 14:34
  • 1
    Not on my Windows machine. I assume you use a different architecture. – Erwin Kalvelagen Sep 08 '21 at 15:46
  • yes I'm on a Mac – moodymudskipper Sep 08 '21 at 15:54

2 Answers2

8

This can be modeled as a small LP (Linear Programming) model. We introduce non-negative variables move(c,z,z') indicating the number of persons of category c to be moved from zone z to zone z'. The mathematical model can look like:

enter image description here

This can be implemented using any LP solver. A solution can look like:

----     83 VARIABLE move.L  moves needed to meet capacity

                 zone1       zone2       zone3

catA.zone1                       6
catA.zone4                      29          62
catC.zone4          27


----     83 VARIABLE alloc.L  new allocation

           zone1       zone2       zone3       zone4

catA         132         180         196
catB          99          83         121         196
catC         187          81          33          39
catD          47          49          34          96


----     83 VARIABLE totcost.L             =       12.400  total cost

Notes:

  • Interestingly the solution shows that we move people out of zone 1 to make room for people from zone 4. So in some cases, making 2 moves to resettle one person is cheaper. Of course, that depends very much on the cost structure.
  • The main constraint says: allocation = demand + inflow - outflow
  • The constraint move(c,z,z)=0 makes sure we don't move from z to z itself. This constraint is not really needed (it is implicitly enforced by the cost). I have added it for clarity. Actually, I implemented this by setting the upper bound of move(c,z,z) to zero (i.e. without an explicit constraint). For very large models I would use another possibility: don't even generate the variables move(c,z,z). This model is small, so no need for that. You can leave it out completely if you want.
  • I don't use population in the model. I don't think it is needed, that is unless we look at the next bullet.
  • There are some subtleties to think about: can we only move new persons? (i.e. original people should be allowed to stay)
Erwin Kalvelagen
  • 15,677
  • 2
  • 14
  • 39
  • Thanks Erwin, I was indeed thinking about only moving new persons, but I'll have to see how this works on the real case. In any case I believe your solution can be used if we subtract the population from capacity and demand, with the assumption that demand won't ever be below population (not sure yet if it will happen in my case). I'll try to work out an R solution from this answer and your other answer in https://stackoverflow.com/q/58434400/11374827 – moodymudskipper Sep 05 '21 at 14:11
  • Hi Erwin, given that I have about 20 categories and 30 zones, I'll end up with about 20k unknowns (moves and allocs). Do you know if this is a big computation ? as in will it take hours on a modern laptop (or is it even possible to run) ? Or inversely is it still quite a small problem. – moodymudskipper Sep 08 '21 at 15:02
  • 1
    Not very big for a pure continuous LP. A good LP solver should be able to handle this in a matter of minutes. – Erwin Kalvelagen Sep 08 '21 at 15:48
  • Thanks Erwin for getting me on track and giving me further clues, the full case ran in a few seconds in the end. – moodymudskipper Sep 12 '21 at 09:06
6

I've taken Erwin's formulation, modified it to consider that alloc should be more than the population for every zone and category, (which means already present individuals don't move), and implemented it using the {lpSolve} package, which doesn't require installing external system libraries.

Erwin's solution can be obtained by using move_new_only <- FALSE below.

SETUP

library(tidyverse)
library(lpSolve)

move_new_only  <- TRUE # means population in place can't be reallocated
categories     <- unique(population_and_demand_by_category_and_zone$category)
zones          <- unique(population_and_demand_by_category_and_zone$zone)
n_cat          <- length(categories)
n_zones        <- length(zones)

# empty coefficient arrays
move_coefs_template  <- array(0, c(n_zones, n_zones, n_cat), 
                              dimnames = list(zones, zones, categories))
alloc_coefs_template <- matrix(0, n_zones, n_cat, 
                               dimnames = list(zones, categories))

build_zone_by_category_matrix <- function(data, col) {
  data %>% 
    pivot_wider(
      id_cols = zone, names_from = category, values_from = {{col}}) %>% 
    as.data.frame() %>% 
    `row.names<-`(.$zone) %>% 
    select(-zone) %>% 
    as.matrix()
}

demand_mat <- build_zone_by_category_matrix(
  population_and_demand_by_category_and_zone, demand)

cost_mat <- build_zone_by_category_matrix(costs, cost)

population_mat <- build_zone_by_category_matrix(
  population_and_demand_by_category_and_zone, population)

OBJECTIVE FUNCTION : total cost

# stack the cost matrix vertically to build an array of all move coefficients
coefs_obj <- move_coefs_template
for(i in 1:n_zones) {
  coefs_obj[i,,] <- cost_mat
}
# flatten it for `lp`s `objective.in` argument, adding alloc coefs
f.obj <- c(coefs_obj, alloc_coefs_template)

CONSTRAINT 1 : allocation = demand + inflow - outflow

coefs_con <- list() 
for (z in zones) {
  coefs_con_zone <- list() 
  for(categ in categories) {
    coefs_arrivals <- move_coefs_template
    coefs_arrivals[,z, categ] <- 1
    coefs_departures <- move_coefs_template
    coefs_departures[z,, categ] <- 1
    coefs_moves <- coefs_arrivals - coefs_departures
    coefs_alloc <- alloc_coefs_template
    coefs_alloc[z, categ] <- -1
    # flatten the array
    coefs_con_zone[[categ]] <- c(coefs_moves, coefs_alloc)
  }
  coefs_con[[z]] <- do.call(rbind, coefs_con_zone)
}

# stack the flattened arrays to build a matrix
f.con1 <- do.call(rbind, coefs_con)
f.dir1 <- rep("==", n_zones * n_cat)
f.rhs1 <- -c(t(demand_mat)) # transposing so we start with all zone 1 and so on

CONSTRAINT 2 : Allocation never exceeds capacity

coefs_con <- list() 
for (z in zones) {
  coefs_alloc <- alloc_coefs_template
  coefs_alloc[z, ] <- 1
  coefs_con[[z]] <- c(move_coefs_template, coefs_alloc)
}

# stack the flattened arrays to build a matrix
f.con2 <- do.call(rbind, coefs_con)
f.dir2 <- rep("<=", n_zones)

f.rhs2 <- demand_and_capacity_by_zone$capacity

CONSTRAINT 3 : Allocation >= Population

i.e. we don't move people that were already there.

If we decide we can move them the rule becomes Allocation >= 0, and we get Erwin's answer.

coefs_con <- list() 
for (z in zones) {
  coefs_con_zone <- list() 
  for(categ in categories) {
    coefs_alloc <- alloc_coefs_template
    coefs_alloc[z, categ] <- 1
    # flatten the array
    coefs_con_zone[[categ]] <- c(move_coefs_template, coefs_alloc)
  }
  coefs_con[[z]] <- do.call(rbind, coefs_con_zone)
}

# stack the flattened arrays to build a matrix
f.con3 <- do.call(rbind, coefs_con)
f.dir3 <- rep(">=", n_zones * n_cat)

if (move_new_only) {
  f.rhs3 <- c(t(population_mat))
} else {
  f.rhs3 <- rep(0, n_zones * n_cat)
}

CONCATENATE OBJECTS

f.con <- rbind(f.con1, f.con2, f.con3)
f.dir <- c(f.dir1, f.dir2, f.dir3)
f.rhs <- c(f.rhs1, f.rhs2, f.rhs3)

SOLVE

# compute the solution and visualize it in the array
results_raw <- lp("min", f.obj, f.con, f.dir, f.rhs)$solution
results_moves <- move_coefs_template
results_moves[] <- 
  results_raw[1:length(results_moves)]
results_allocs <- alloc_coefs_template
results_allocs[] <- 
  results_raw[length(results_moves)+(1:length(results_allocs))]
results_moves
#> , , A
#> 
#>    1 2 3 4
#> 1  0 0 0 0
#> 2  0 0 3 0
#> 3  0 0 0 0
#> 4 13 0 2 0
#> 
#> , , B
#> 
#>   1 2  3 4
#> 1 0 0  0 0
#> 2 0 0  0 0
#> 3 0 0  0 0
#> 4 0 0 57 0
#> 
#> , , C
#> 
#>   1 2 3 4
#> 1 0 0 0 0
#> 2 0 0 0 0
#> 3 0 0 0 0
#> 4 8 0 0 0
#> 
#> , , D
#> 
#>   1  2 3 4
#> 1 0  0 0 0
#> 2 0  0 0 0
#> 3 0  0 0 0
#> 4 0 38 0 0

results_allocs
#>     A   B   C  D
#> 1 151  99 168 47
#> 2 142  83  81 87
#> 3 139 178  33 34
#> 4  76 139  58 58

TIDY RESULTS

# format as tidy data frame
results_df <-
  as.data.frame.table(results_moves) %>% 
  setNames(c("from", "to", "category", "n")) %>% 
  filter(n != 0) %>% 
  mutate_at(c("from", "to"), as.numeric) %>% 
  mutate_if(is.factor, as.character)

results_df
#>   from to category  n
#> 1    4  1        A 13
#> 2    2  3        A  3
#> 3    4  3        A  2
#> 4    4  3        B 57
#> 5    4  1        C  8
#> 6    4  2        D 38

UPDATE TABLES

population_and_demand_by_category_and_zone <-
  bind_rows(
  results_df %>% 
  group_by(category, zone = to) %>% 
  summarize(correction = sum(n), .groups = "drop"),
  results_df %>% 
    group_by(category, zone = from) %>% 
    summarize(correction = -sum(n), .groups = "drop"),
  ) %>% 
  left_join(population_and_demand_by_category_and_zone, ., by = c("category", "zone")) %>% 
  replace_na(list(correction =0)) %>% 
  mutate(new_population = demand + correction)

population_and_demand_by_category_and_zone
#> # A tibble: 16 × 6
#>    category  zone population demand correction new_population
#>    <chr>    <dbl>      <dbl>  <dbl>      <dbl>          <dbl>
#>  1 A            1        115    138      13               151
#>  2 A            2        121    145      -3.00            142
#>  3 A            3        112    134       5.00            139
#>  4 A            4         76     91     -15.0              76
#>  5 B            1         70     99       0                99
#>  6 B            2         59     83       0                83
#>  7 B            3         86    121      57               178
#>  8 B            4        139    196     -57               139
#>  9 C            1        142    160       8               168
#> 10 C            2         72     81       0                81
#> 11 C            3         29     33       0                33
#> 12 C            4         58     66      -8                58
#> 13 D            1         22     47       0                47
#> 14 D            2         23     49      38                87
#> 15 D            3         16     34       0                34
#> 16 D            4         45     96     -38                58


demand_and_capacity_by_zone <-
  population_and_demand_by_category_and_zone %>% 
  group_by(zone) %>% 
  summarise(population = sum(population), correction = sum(correction), new_population = sum(new_population)) %>% 
  left_join(demand_and_capacity_by_zone, ., by = "zone")
#> `summarise()` ungrouping output (override with `.groups` argument)
  
demand_and_capacity_by_zone
#> # A tibble: 4 × 7
#>    zone demand capacity capacity_exceeded population correction new_population
#>   <dbl>  <dbl>    <dbl> <lgl>                  <dbl>      <dbl>          <dbl>
#> 1     1    444      465 FALSE                    349         21            465
#> 2     2    358      393 FALSE                    275         35            393
#> 3     3    322      500 FALSE                    243         62            384
#> 4     4    449      331 TRUE                     318       -118            331

We see that the population never decreases and stays under capacity.

moodymudskipper
  • 46,417
  • 11
  • 121
  • 167