7

In a matrix, e.g. M1, rows are countries and columns are years. The countries don't have observations for the same years. I want to find the "best" intersect of years that gives me the most countries. The number of minimum years and minimum countries will be predefined. Which countries are included in the result doesn't matter, the years don't have to be consecutive.

> M1
      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15]
 [1,]   NA   NA   NA 2004   NA 2006   NA 2008 2009    NA  2011  2012    NA    NA    NA
 [2,]   NA 2002   NA 2004   NA   NA 2007   NA   NA  2010  2011    NA  2013  2014    NA
 [3,]   NA   NA   NA 2004 2005 2006 2007 2008 2009    NA    NA  2012  2013    NA  2015
 [4,]   NA 2002   NA 2004 2005 2006 2007 2008   NA  2010  2011    NA  2013    NA    NA
 [5,] 2001   NA   NA   NA 2005 2006 2007 2008   NA  2010    NA  2012  2013  2014    NA
 [6,] 2001   NA 2003 2004 2005 2006 2007 2008 2009  2010  2011  2012    NA  2014    NA
 [7,] 2001 2002   NA   NA 2005   NA 2007   NA 2009    NA  2011    NA    NA  2014  2015
 [8,] 2001 2002   NA 2004 2005 2006   NA   NA   NA  2010    NA    NA  2013    NA  2015
 [9,]   NA 2002   NA 2004 2005   NA 2007   NA   NA  2010  2011    NA    NA    NA    NA
[10,] 2001 2002   NA 2004   NA   NA   NA   NA   NA  2010    NA  2012    NA  2014  2015

Because there is no obvious intersect, a single Reduce(intersect...) attempt won't work, and I do that repeatedly by successively excluding one country up to the defined threshold n.row. The result is filtered for a minimum of years n.col. I wrote this function,

findBestIntersect <- function(M, min.row=5, min.col=3) {
  ## min.row: minimum number of rows (countries) to analyze
  ## min.col: minimum number of complete columns (years)
  # put matrices with row combn into list (HUGE!)
  L1 <- lapply(min.row:(nrow(M) - 1), function(x)
    combn(nrow(M), x, function(i) M[i, ], simplify=FALSE))
  # select lists w/ def. number of complete columns
  slc <- sapply(L1, function(y)  # numbers of lists
    which(sapply(y, function(x)
      sum(!(apply(x, 2, function(i) any(is.na(i))))))
      >= min.col))
  # list selected lists
  L2 <- Map(function(x, i)
    x[i], L1[lengths(slc) > 0], slc[lengths(slc) > 0])
  # find intersects
  L3 <- rapply(L2, function(l)
    as.integer(na.omit(Reduce(intersect, as.list(as.data.frame(t(l)))))),
    how="list")
  return(unique(unlist(L3, recursive=FALSE)))
}

which gives me the desired result for M1 in no time.

> system.time(best.yrs.1 <- findBestIntersect(M1))
   user  system elapsed 
   0.06    0.00    0.07 

> best.yrs.1
[[1]]
[1] 2002 2004 2010

However the performance for M2 was only just acceptable (RAM usage around 1.1 GB),

> system.time(best.yrs.2 <- findBestIntersect(M2))
   user  system elapsed 
  79.90    0.39   82.76 
> head(best.yrs.2, 3)
[[1]]
[1] 2002 2009 2015

[[2]]
[1] 2002 2014 2015

[[3]]
[1] 2003 2009 2010

and you don't want to try this with M3 (blasts 32 GB RAM) which resembles my real matrix:

# best.yrs.3 <- findBestIntersect(M3)

Probably the biggest flaw of the function is that L1 becomes too big very fast.

So, my question is, would there be a better method that is also applicable to M3? The "bonus" would be to maximize both, countries and years. If possible I want to do this without additional packages.

Data

set.seed(42)
tf <- matrix(sample(c(TRUE, FALSE), 150, replace=TRUE), 10)
M1 <- t(replicate(10, 2001:2015, simplify=TRUE))
M1[tf] <- NA

tf <- matrix(sample(c(TRUE, FALSE), 300, replace=TRUE), 20)
M2 <- t(replicate(20, 2001:2015, simplify=TRUE))
M2[tf] <- NA

tf <- matrix(sample(c(TRUE, FALSE), 1488, replace=TRUE), 31)
M3 <- t(replicate(31, 1969:2016, simplify=TRUE))
M3[tf] <- NA
Community
  • 1
  • 1
jay.sf
  • 60,139
  • 8
  • 53
  • 110
  • @BenNutzer You mean sth like `tf^1 %*% t(tf^1)`? What of my question is not clear for you? – jay.sf Aug 17 '19 at 08:19
  • The problem stems from the `combn()` function unfortunately, i.e. if I run just `combn(31,16)`, I face with memory problems. Even if I try `comboGeneral()` from `RcppAlgos` package, it is again becomes problematic. – maydin Aug 17 '19 at 08:27
  • @BenNutzer Ok, i.e. `t(tf^1) %*% tf^1`. Looks promisingly. However I'm not sure what I can do with this output? – jay.sf Aug 17 '19 at 08:38
  • @maydin You exactly describe my problem :) – jay.sf Aug 17 '19 at 08:39
  • I would like to give this a shot with mixed integer programming but unfortunately I haven't really understood the question. What is the objective and constraints here? What does *" find the "best" intersect of years that gives me the most countries"* mean? – Shree Aug 18 '19 at 14:41
  • @Shree My objective is to find the "best" combination of years (with a defined minimum, e.g. 3 years, gaps in the combination are possible), that gives me the largest number of countries with data (non-`NA`) in those years. – jay.sf Aug 19 '19 at 06:18
  • @chinsoon12 Of course there could be several best combinations if there are several possibilities for the number of years (several first-ranking), e.g. 3 as in my example. – jay.sf Aug 19 '19 at 06:25

3 Answers3

3

I wrote a coded_best_intersect function that relies on creating a for loop dynamically in a code_maker function. It evaluates M3 in 30 seconds. Because the code generates a list, I am depending on data.table for rbindlist and the print method.

library(data.table)

code_maker function:

code_maker <- function(non_NA_M, n, k, min.col) {
  ## initializing for results
  res <- list()
  z <- 1
  ## initializing naming
  col_names <- colnames(non_NA_M)
  i_s <- paste0('i', seq_len(k))
  ## create the foor loop text. It looks like this mostly
  ## for (i1 in 1:(n - k + 1)) { for (i2 in (i1 + 1):(n-k+2)) {}}
  for_loop <- paste0('for (', i_s, ' in ', c('1:', paste0('(', i_s[-k], ' + 1):')), 
                     n - k + seq_len(k), ')', ' {\n non_na_sums', seq_len(k), 
                     '=non_NA_M[', i_s, ', ] ',
                     c('', paste0('& ', rep('non_na_sums', k - 1), seq_len(k)[-k])), '', 
                     '\n if (sum(non_na_sums', seq_len(k), ') < ', min.col, ') {next} ', 
                     collapse='\n')
  ## create the assignment back to the results which looks like
  ## res[[z]] <- data.table(M=k, N=sum(non_na_sumsk), ROWS=list(c(i1, i2, ..., ik)), 
  ##                        YEARS=list(col_names[non_na_sumsk]))
  inner_text <- paste0('\nres[[z]] <- data.table(M=k, N=sum(non_na_sums',
                       k, '), ROWS=list(c( ', paste0(i_s, collapse=', '), 
                       ')), YEARS=list(col_names[non_na_sums', k , ']))\nz <- z + 1')
  ## combines the loop parts and closes the for with }}}
  for_loop <- paste(for_loop, 
                    inner_text, 
                    paste0(rep('}', k), collapse=''))
  ## evaluate - the evaluation will assign back to res[[i]]  
  eval(parse(text=for_loop))
  res <- rbindlist(res)
  if (length(res) == 0) { #to return emtpy data.table with the correct fields
    return(data.table(M=integer(), N=integer(), ROWS=list(), YEARS=list()))
  }
  res$M <- k
  return(res)
}

coded_best_intersect function:

coded_best_intersect <- function(M, min.row=5, min.col=3) {
  colnames(M) <- apply(M, 2, function(x) na.omit(x)[1])
  n_row <- nrow(M)
  non_NA <- !is.na(M)
  n_combos <- min.row:(n_row - 1)
  res2 <- list()
  for (i in seq_along(n_combos)) {
    res2[[i]] <- code_maker(non_NA, n=n_row, k=n_combos[i], min.col)
    if (nrow(res2[[i]]) == 0) {
      break
    }
  }
  return(res2)
}

This is e.g. the code generated on the fly for k=5:

# for (i1 in 1:5) {
#   non_na_sums1=non_NA_M[i1, ] 
#   if (sum(non_na_sums1) < 3) {next} 
#   for (i2 in (i1 + 1):6) {
#     non_na_sums2=non_NA_M[i2, ] & non_na_sums1
#     if (sum(non_na_sums2) < 3) {next} 
#     for (i3 in (i2 + 1):7) {
#       non_na_sums3=non_NA_M[i3, ] & non_na_sums2
#       if (sum(non_na_sums3) < 3) {next} 
#       for (i4 in (i3 + 1):8) {
#         non_na_sums4=non_NA_M[i4, ] & non_na_sums3
#         if (sum(non_na_sums4) < 3) {next} 
#         for (i5 in (i4 + 1):9) {
#           non_na_sums5=non_NA_M[i5, ] & non_na_sums4
#           if (sum(non_na_sums5) < 3) {next} 
#           for (i6 in (i5 + 1):10) {
#             non_na_sums6=non_NA_M[i6, ] & non_na_sums5
#             if (sum(non_na_sums6) < 3) {next}  
#             res[[z]] <- data.table(M=k, N=sum(non_na_sums6), 
#                                    ROWS=list(c( i1, i2, i3, i4, i5, i6)),
#                                    YEARS=list(col_names[non_na_sums6]))
#             z <- z + 1 }}}}}}

You can likely notice the {next} which is a way for it to skip a combination if there's no possible way to get a minimum of 3 columns. And while it looks like it is all hard-coded, the code is actually a string generated, parsed, and then evaluated.

Usage and Performance

Matrix M1:

system.time(final1 <- coded_best_intersect(M1))
   user  system elapsed 
      0       0       0 
data.table::rbindlist(final1)[order(-M*N)]
   M N           ROWS          YEARS
1: 5 3  2, 4, 8, 9,10 2002,2004,2010

Matrix M2:

system.time(final2 <- coded_best_intersect(M2))
   user  system elapsed 
   0.08    0.00    0.08 
data.table::rbindlist(final2)[order(-M*N)]
     M N                  ROWS               YEARS
  1: 7 3  6, 8,11,12,13,16,...      2002,2012,2013
  2: 5 4         6, 8,13,16,17 2002,2012,2013,2015
  3: 5 4         8,11,12,13,17 2002,2012,2013,2014
  4: 6 3      1, 4, 8,13,17,20      2002,2014,2015
  5: 6 3      2, 5, 6,10,14,17      2003,2006,2008
 ---                                              
126: 5 3        10,12,13,17,20      2002,2008,2014
127: 5 3        10,12,14,17,20      2003,2008,2014
128: 5 3        11,12,13,16,17      2002,2012,2013
129: 5 3        11,12,13,17,20      2002,2012,2014
130: 5 3        12,13,15,16,19      2001,2002,2013

Matrix M3:

system.time(final3 <- coded_best_intersect(M3))
   user  system elapsed 
  29.37    0.05   29.54 
data.table::rbindlist(final3)[order(-M*N)]
       M N              ROWS                             YEARS
    1: 6 7  1, 3, 8,15,20,29 1969,1973,1980,1984,1985,1992,...
    2: 5 8     1, 3, 8,14,29 1969,1973,1976,1980,1984,1987,...
    3: 5 8     1, 3, 8,20,29 1969,1973,1980,1984,1985,1992,...
    4: 5 8     2, 7, 9,13,17 1974,1993,1994,2004,2012,2013,...
    5: 5 8     3, 6, 8, 9,27 1974,1980,1984,1987,1995,1998,...
   ---                                                        
52374: 5 3    23,24,25,30,31                    1979,1997,2002
52375: 5 3    23,25,28,30,31                    1979,1992,2002
52376: 5 3    24,25,26,30,31                    1983,1997,2002
52377: 5 3    24,25,28,30,31                    1979,1983,2002
52378: 5 3    24,26,28,30,31                    1983,1986,2002

To put the selected part of a result into a character string, you can do e.g. the following:

x <- data.table::rbindlist(final3)[order(-M*N)]
el(x$YEARS[1])  # select `YEARS` of result-row `1:`
# [1] "1969" "1973" "1980" "1984" "1985" "1992" "2003"

Note: See edit history for two other very different approaches. The first was melt and join techniques which blew up the memory. The second approach was using RcppAlgos::comboGeneral to evaluate a function.

jay.sf
  • 60,139
  • 8
  • 53
  • 110
Cole
  • 11,130
  • 1
  • 9
  • 24
  • Nice solution, but that also overwhelms my 32GB RAM with `M3`. However, it's interesting to watch the RAM while working, it seems to be flushed in between from time to time. May we tell `data.table` to do this more often? – jay.sf Aug 19 '19 at 07:23
  • In addition to an algorithm, it would perform better if the big join would have an ```.EACHI``` statement so that it would count the values without blowing up too much RAM. But I still think the memory issue is due to all the combinations and building that 300 million row giant with 16 columns. – Cole Aug 19 '19 at 10:47
  • I'm not very familiar with `data.table`, maybe you'd like to elaborate a little about `.EACHI`? – jay.sf Aug 19 '19 at 15:38
  • Unfortunately, I couldn't get ```by = .EACHI``` to work. The command allows to aggregate on the join. So instead of ```X[Y][, .N, by = Y.col1]``` you can do ```X[Y, .N, by = col1]``` which should allow better performance, especially with respect to memory. Regardless, I made an update to get around M3 but I'm rethinking my ```melt``` approach as I think I can directly subset instead of joining. – Cole Aug 20 '19 at 04:07
  • @jay.sf I optimized the function to depend on the previous results. This completes the ```M2``` matrix in less than a second. Additionally, you can include a function argument to limit how many combinations would be evaluated before the function is exited. – Cole Aug 21 '19 at 03:13
  • Looks promising, your solution seems to yield the best results so far. However, the `N` column should be renamed to `M` (for columns or years) and a real `N` column (for rows or countries) should be included somehow. Otherwise we still need to browse the lists and it is hard to find e.g. the `unlist(res_3[[2]][1, ]$Years)` result which is a clear favorite. And what exactly do these `v` values mean? – jay.sf Aug 22 '19 at 07:52
  • I made a big edit and instead of a ```melt``` and join approach, it is now mainly ```RcppAlgos::comboGeneral```. ```data.table``` is still used - primarily because I like how it displays lists. In the previous approach, the ```V``` values corresponded to the rows which matched the criteria. It is now more explicit with an actual ```Rows``` column. – Cole Aug 22 '19 at 11:05
  • @jay.sf see edit. The main part of the answer relies on base. Also, if 30 seconds is too slow, translating the output to ```Rcpp``` would likely get it to where you need it to be. – Cole Aug 24 '19 at 19:19
  • 1
    This looks really great now! You've actually eliminated the shortcoming with the premature abortion of the loop. I hope you don't mind that I restructured it a little. – jay.sf Aug 25 '19 at 05:44
  • 1
    The restructuring looks great - thanks for taking the time to update. – Cole Aug 25 '19 at 12:40
3

This is a trivial problem using mixed integer programming and can be solved very quickly even with weak open source solver like glpk. I am using ompr package for mathematical modeling (more info on ompr) and have included the model logic as comments in code. Note that my random data is different than OP's due to different R versions I guess.

Total run time was around a minute (i.e. actual solve time is even less) for M3 when model set to maximize data for at most 15 years. This method will easily scale up for even larger instances.

library(dplyr)
library(ROI)
library(ROI.plugin.glpk)
library(ompr)
library(ompr.roi)

set.seed(42)
tf <- matrix(sample(c(TRUE, FALSE), 1488, replace=TRUE), 31)
M3 <- t(replicate(31, 1969:2016, simplify=TRUE))
M3[tf] <- NA

m <- +!is.na(M3) # gets logical matrix; 0 if NA else 1    
nr <- nrow(m)
nc <- ncol(m)
n_years <- 15 

model <- MIPModel() %>% 
  # keep[i,j] is 1 if matrix cell [i,j] is to be kept else 0
  add_variable(keep[i,j], i = 1:nr, j = 1:nc, typ = "binary") %>% 
  # rm_row[i] is 1 if row i is selected for removal else 0
  add_variable(rm_row[i], i = 1:nr, type = "binary") %>% 
  # rm_col[j] is 1 if column j is selected for removal else 0
  add_variable(rm_col[j], j = 1:nc, type = "binary") %>% 
  # maximize good cells kept
  set_objective(sum_expr(keep[i,j], i = 1:nr, j = 1:nc), "max") %>% 
  # cell can be kept only when row is not selected for removal
  add_constraint(sum_expr(keep[i,j], j = 1:nc) <= 1 - rm_row[i], i = 1:nr) %>%
  # cell can be kept only when column is not selected for removal
  add_constraint(sum_expr(keep[i,j], i = 1:nr) <= 1 - rm_col[j], j = 1:nc) %>%
  # only non-NA values can be kept
  add_constraint(m[i,j] + rm_row[i] + rm_col[j] >= 1, i = 1:nr, j = 1:nc) %>% 
  # keep at most n_years columns i.e. remove at least (nc - n_years) columns
  # I used >= instead of == to avoid infeasiblity
  add_constraint(sum_expr(rm_col[j], j = 1:nc) >= nc - n_years) %>% 
  # solve using free glpk solver
  solve_model(with_ROI(solver = "glpk"))

Results -

solver_status(model)
# [1] "optimal"    <- indicates guaranteed optimum (at least one of the many possible)

# get rows to remove
rm_rows <- model %>% 
  get_solution(rm_row[i]) %>% 
  filter(value > 0) %>% pull(i) %>% print()

# [1]  1  2  3  4  6  8  9 11 12 13 14 15 17 18 19 20 21 22 23 25 27 28 29 30 31

# get columns to remove
rm_cols <- model %>% 
  get_solution(rm_col[j]) %>% 
  filter(value > 0) %>% pull(j) %>% print()

# [1]  2  3  4  5  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
# [24] 27 28 29 30 31 32 33 34 35 36 38 39 40 41 44 45 46 47 48

result <- M3[-rm_rows, -rm_cols, drop = F]

     [,1] [,2] [,3] [,4] [,5] [,6]
[1,] 1969 1974 1994 2005 2010 2011
[2,] 1969 1974 1994 2005 2010 2011
[3,] 1969 1974 1994 2005 2010 2011
[4,] 1969 1974 1994 2005 2010 2011
[5,] 1969 1974 1994 2005 2010 2011
[6,] 1969 1974 1994 2005 2010 2011
Shree
  • 10,835
  • 1
  • 14
  • 36
  • Thanks, interesting approach. However, I think we still have different logics. By "intersect," I actually mean that in the result, these "best" years (rows) should be included in all remaining countries (columns) after omitting `NA`s, so that there is no "NA" at the end (after deleting the _other_ rows or countries). E.g. in `M1` shown above in my question the years `2006`, `2008`, and `2012` (i.e. the desired output, the original values above seem to be wrong) give me a 4x3 matrix without `NA`s, the skipped years between `2006` and `2008`, and between `2008` and `2012` I consider as "gaps". – jay.sf Aug 19 '19 at 17:37
  • @jay.sf Are you saying that you want `n` columns with no `NA` such that number of countries (rows) is maximized? Can you include output matrix for `M1` in your post? – Shree Aug 19 '19 at 17:49
  • Yeah exactly, consider the values binary as you have done somewhat in your code -- `1` is value `0` is missing. Then I'd like to have as much rows (i.e. countries) as possible with common `1`s in columns and delete those columns with any `0` after. I'm attempting regression analysis on sparse data and need at least 3-5 countries and the more years the better. – jay.sf Aug 19 '19 at 17:53
  • @jay.sf Okay, would you say output of `3x5` is better than `5x3`? Both have 15 valid cells. Basically, for the same number of valid cells, would you rather have more columns or more rows? – Shree Aug 19 '19 at 17:58
  • You're right, probably kind of a list output would be more feasible to assess the trade-off with the obtained data manually, The situation in my real data is, that I hope to have at best two years for a reasonable number of countries. Three or even more would be cigar... – jay.sf Aug 19 '19 at 18:11
  • Thanks for the update, your function seems to be doing the expected now, with `M3` in just around 50 sec. on my machine. However, it misses the 6 x 7 solution I found with @Cole's solution: `res <- M3[, c(1L, 5L, 12L, 16L, 17L, 24L, 35L)]; res[complete.cases(res), ]`. Do you see any room for improvement? – jay.sf Aug 22 '19 at 06:03
  • @jay.sf your code returned an empty dataframe for me. It's probably because my random data with `seed(42)` is different than yours because of different R versions (I have 3.4.1). FYI, my solution, being `"optimal"`, cannot be beaten for number of good cells returned. However the combination of output `rows x columns` can vary if multiple solutions exist for same number of good cells i.e. say `5x3` vs `3x5`. – Shree Aug 22 '19 at 13:52
1

Since the combination uses a lot of memory without giving any result (at least in my computer it gave an error) , maybe clustering the data can give a solution.

Data: set.seed(42)

> M1
      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15]
 [1,] 2001   NA 2003 2004   NA   NA 2007   NA 2009  2010  2011  2012    NA  2014    NA
 [2,] 2001 2002   NA 2004   NA   NA 2007   NA   NA    NA    NA  2012    NA  2014    NA
 [3,]   NA 2002 2003   NA   NA   NA 2007   NA   NA    NA    NA  2012  2013  2014    NA
 [4,] 2001   NA 2003 2004 2005 2006 2007   NA 2009  2010    NA    NA  2013  2014  2015
 [5,] 2001   NA   NA   NA   NA   NA 2007   NA 2009  2010  2011  2012  2013  2014  2015
 [6,] 2001 2002 2003 2004 2005 2006   NA 2008 2009  2010  2011  2012    NA  2014  2015
 [7,] 2001 2002   NA   NA 2005 2006   NA   NA   NA    NA  2011    NA  2013  2014    NA
 [8,]   NA   NA 2003   NA 2005   NA 2007   NA   NA  2010  2011    NA  2013    NA  2015
 [9,] 2001   NA   NA 2004 2005   NA 2007 2008   NA  2010  2011  2012    NA  2014    NA
[10,] 2001 2002 2003 2004 2005 2006   NA   NA   NA  2010    NA  2012  2013  2014    NA

I used klaR library for using kmodes function which is for clustering the categorical variables (similar to kmeans)

Function:

  library(klaR)
  library(matrixStats)

opt <- function(data,iter) {
        my_list <- list()   # It will be assigned by the cluster outputs
        t_M <- t(!is.na(data)) # Transforming the data into TRUE/FALSE in other words 1/0   
        result <- lapply(1:iter, function(rand) { # This is for assigning new seeds.
                    set.seed(rand^2)
                    for(i in 2:nrow(t_M)-1) { # Loop for the  Number of the clusters 

                        # cluster function
                        cluster.results <-kmodes(t_M, i, iter.max = 100, weighted = FALSE , fast = TRUE) 


                    # Binding the data and clusters output. And adding index for finding the location                   
                    df <- as.data.frame(cbind(Index=1:nrow(t_M),Cluster=cluster.results$cluster,t_M)) 

                    # Below is the calculation step. It multiplies the columns. The idea is here if 
                    # there exist an intersection, the product must be 1. Otherwise it is 0.
                    calc <-  sapply(1:i , function(x){

                                if(nrow(df[df[,"Cluster"]==x,])<=1) {
                                    quantity <- 0
                                }else{

                                    quantity <- sum(colProds(as.matrix(df[df[,"Cluster"]==x,][,- c(1,2)]))) 
                                }
                             })
                    # Creating a new df named out, just for arranging the output.
                    out <- data.frame(Cluster= 1:i ,
                                        Quantity= calc,
                                            Size= cluster.results$size,
                                                Iter = paste0(rand,"-",i)   )
                    # Merging our first dataframw (df) with out (without unimportant columns)                       
                    my_list[[i]] <- merge(df[, c("Cluster","Index")],out,by="Cluster")

                }
            # Binding all outputs in my_list. It includes all clusters from 1:nrow(t_M)
            do.call(rbind,my_list)

            })
    # Since we have done the same steps as much as the Iter input, This is for binding all.
    # So it includes, iter * clusteroutput data
    result <- do.call(rbind,result)
    # Neglecting unnecessary columns 
    result <- result[,-(ncol(result)-2)]

    colnames(result) <- c("Cluster","Index","Matching","Years","Iter")

return(result)
}

The clustering depends on the seeding too much. So when seeding changes, clustering output may vary. For that reason I calculated possible clusters with respect to different seeds. The idea in here, to detect the similar groups and make the calculation on them without diving into all combinations.

#100 is the number of the seeds, when it increases the calculation time also increases. 
#For calculating M3 matrix, it may be good to decrease the seed amount.
#However, it may decrease the quality of the clustering. (Trade off)

my_result <- opt(M1,100) # It takes about one minute in my computer.

tail(my_result)

      Cluster Index Matching Years   Iter
22495      10     1        0     1 100-15
22496      11    13        0     1 100-15
22497      12     2        0     1 100-15
22498      13    10        0     1 100-15
22499      14     5        0     1 100-15
22500      15     8        0     1 100-15

# Years : Number of the Years in the solution
# Matching : Number of the matches in a cluster
# Cluster : The Cluster Number
# Iter : Iteration of Cluster
# Index : The index of where the cluster is located at the input dataframe

From this point on, it becomes data manipulation work. Any filtering can be done. For example, consider I want to have a look at only 3 Years outputs with the maximum matching. So,

out_list1 <- my_result %>% filter(Years==3) %>% 
            filter(Matching==max(Matching)) %>% 
            group_by(Iter,Cluster,add = TRUE) %>% 
            select(Index) %>% group_split() 

After getting the out_list, by using the function below the desired output can be reached.

Function:

find_match <- function(output,data) {

x <- unique(lapply(1:length(output),function(k) {
        sort(t(output[[k]][3]))
        }))
return(lapply(1:length(x),function(i) na.omit(unique(data[,x[[i]]]))[1,]))

}

If we call the find_match,

find_match(out_list1,M1)

[[1]]
[1] 2001 2004 2014

[[2]]
[1] 2001 2010 2014

[[3]]
[1] 2001 2012 2014

Another trial: Replication of your input findBestIntersect(M1)

out_list2 <- my_result %>% filter(Years>=3) %>%
            filter(Matching>=5) %>% 
            group_by(Iter,Cluster,add = TRUE) %>% 
            select(Index) %>% group_split() 

find_match(out_list2,M1)

[[1]]
[1] 2001 2010 2014

[[2]]
[1] 2001 2004 2010 2014

[[3]]
[1] 2001 2004 2014

[[4]]
[1] 2001 2004 2012 2014

[[5]]
[1] 2001 2007 2014

[[6]]
[1] 2001 2012 2014

[[7]]
[1] 2007 2012 2014

[[8]]
[1] 2001 2004 2010

[[9]]
[1] 2001 2011 2014

[[10]]
[1] 2001 2010 2012 2014

The difference is, your output also gives

2001 2005 2014

Since kmodes is a classifier we can't expect perfect fitting. So this is a misclassification fault as expected from all prediction algorithms. But interestingly, your function also misses the,

2001 2004 2010

which is also a solution.

maydin
  • 3,715
  • 3
  • 10
  • 27
  • @jay.sf I wonder if you had chance to use the solution above. I have just run it as for `M3` and it took 35 minutes for `my_result <- opt(M3,100)` setting in my 8GB ram computer. It gave maximum country which is 16, with *2013 - 2014* , *1997 - 2013* and *2006 - 2014* pairs without any error and memory problem. – maydin Aug 19 '19 at 09:30
  • Yes, I tried it, your output differs slightly from my expectations, though. Obviously, my problem is also a little ambiguous. See the new comments above for some clarification. To be honest, your code is a little hard for me to understand because it is not commented at all. Anyway great idea to run a cluster analysis over the data, I'll try something with `hclust`! – jay.sf Aug 19 '19 at 15:36
  • @jay.sf I added comments inside the codes and arranged the output as you expected. – maydin Aug 20 '19 at 02:14
  • This is an interesting approach but your dataset is slightly different. That's why you have 2001 as part of the solution for ```M1``` but OP and myself do not. – Cole Aug 20 '19 at 04:30
  • @Cole I know. Because in the example data @jsay.sf did not use the `set.seed(42)`. However at the code creation part, he offered us to use `set.seed(42)`. That is why I wrote down almost everywhere of my answer `set.seed(42)` for creating the `M1`, `M2` and `M3`. I wanted to show the possible differences in terms of the input data. But it does not matter, if anyone uses their own matrix, it works. – maydin Aug 20 '19 at 06:47
  • I tried your code for `M3`, the output differs from the other solutions, though. Best result from the other solutions turned out to be a 6 x 7 matrix, which wasn't included in your results. Besides, [different R versions may yield different results with `set.seed()`](https://stackoverflow.com/a/56381613/6574038). I'm using `"R version 3.6.0 RC (2019-07-03 r76779)"`. – jay.sf Aug 22 '19 at 07:34
  • @jay.sf Please check the results by eye to see if the solutions really perform as expected. I just replicate my solution with the outputs of the one of the other solutions below and compare them with my own matrix (you are right btw, seeding my vary version to version) and I found the output of my approach is not wrong. After doing eye check, you can end up with your decision. – maydin Aug 22 '19 at 11:17