In a matrix, e.g. M1
, rows are countries and columns are years. The countries don't have observations for the same years. I want to find the "best" intersect of years that gives me the most countries. The number of minimum years and minimum countries will be predefined. Which countries are included in the result doesn't matter, the years don't have to be consecutive.
> M1
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15]
[1,] NA NA NA 2004 NA 2006 NA 2008 2009 NA 2011 2012 NA NA NA
[2,] NA 2002 NA 2004 NA NA 2007 NA NA 2010 2011 NA 2013 2014 NA
[3,] NA NA NA 2004 2005 2006 2007 2008 2009 NA NA 2012 2013 NA 2015
[4,] NA 2002 NA 2004 2005 2006 2007 2008 NA 2010 2011 NA 2013 NA NA
[5,] 2001 NA NA NA 2005 2006 2007 2008 NA 2010 NA 2012 2013 2014 NA
[6,] 2001 NA 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 NA 2014 NA
[7,] 2001 2002 NA NA 2005 NA 2007 NA 2009 NA 2011 NA NA 2014 2015
[8,] 2001 2002 NA 2004 2005 2006 NA NA NA 2010 NA NA 2013 NA 2015
[9,] NA 2002 NA 2004 2005 NA 2007 NA NA 2010 2011 NA NA NA NA
[10,] 2001 2002 NA 2004 NA NA NA NA NA 2010 NA 2012 NA 2014 2015
Because there is no obvious intersect, a single Reduce(intersect...)
attempt won't work, and I do that repeatedly by successively excluding one country up to the defined threshold n.row
. The result is filtered for a minimum of years n.col
. I wrote this function,
findBestIntersect <- function(M, min.row=5, min.col=3) {
## min.row: minimum number of rows (countries) to analyze
## min.col: minimum number of complete columns (years)
# put matrices with row combn into list (HUGE!)
L1 <- lapply(min.row:(nrow(M) - 1), function(x)
combn(nrow(M), x, function(i) M[i, ], simplify=FALSE))
# select lists w/ def. number of complete columns
slc <- sapply(L1, function(y) # numbers of lists
which(sapply(y, function(x)
sum(!(apply(x, 2, function(i) any(is.na(i))))))
>= min.col))
# list selected lists
L2 <- Map(function(x, i)
x[i], L1[lengths(slc) > 0], slc[lengths(slc) > 0])
# find intersects
L3 <- rapply(L2, function(l)
as.integer(na.omit(Reduce(intersect, as.list(as.data.frame(t(l)))))),
how="list")
return(unique(unlist(L3, recursive=FALSE)))
}
which gives me the desired result for M1
in no time.
> system.time(best.yrs.1 <- findBestIntersect(M1))
user system elapsed
0.06 0.00 0.07
> best.yrs.1
[[1]]
[1] 2002 2004 2010
However the performance for M2
was only just acceptable (RAM usage around 1.1 GB),
> system.time(best.yrs.2 <- findBestIntersect(M2))
user system elapsed
79.90 0.39 82.76
> head(best.yrs.2, 3)
[[1]]
[1] 2002 2009 2015
[[2]]
[1] 2002 2014 2015
[[3]]
[1] 2003 2009 2010
and you don't want to try this with M3
(blasts 32 GB RAM) which resembles my real matrix:
# best.yrs.3 <- findBestIntersect(M3)
Probably the biggest flaw of the function is that L1
becomes too big very fast.
So, my question is, would there be a better method that is also applicable to M3
? The "bonus" would be to maximize both, countries and years. If possible I want to do this without additional packages.
Data
set.seed(42)
tf <- matrix(sample(c(TRUE, FALSE), 150, replace=TRUE), 10)
M1 <- t(replicate(10, 2001:2015, simplify=TRUE))
M1[tf] <- NA
tf <- matrix(sample(c(TRUE, FALSE), 300, replace=TRUE), 20)
M2 <- t(replicate(20, 2001:2015, simplify=TRUE))
M2[tf] <- NA
tf <- matrix(sample(c(TRUE, FALSE), 1488, replace=TRUE), 31)
M3 <- t(replicate(31, 1969:2016, simplify=TRUE))
M3[tf] <- NA