1

I have two lists, each containing a few thousand data tables. The data tables contain id's and each id will only appear once within each list. Additionally, each data table will have different columns, though they will share column names with some other data tables. For example, in my lists created below, id 1 appears in the 1st data table in list1 and the 2nd data table in list2. In the first list id 1 has data for columns 'a' and 'd' and in the second list it has columns for 'a' and 'b'.

library(data.table)
# Create 2 lists of data frames
list1 <- list(data.table(id=c(1,3), a=c(0,0), d=c(1,1)),
              data.table(id=c(2,4), b=c(1,0), c=c(2,1), f=c(3,1)),
              data.table(id=c(5,6), a=c(4,0), b=c(2,1)))

list2 <- list(data.table(id=c(2,3,6), c=c(0,0,1), d=c(1,1,0), e=c(0,1,2)),
              data.table(id=c(1,4,5), a=c(1,0,3), b=c(2,1,2)))

What I need to do is find the id in each list, and average their results.

 list id  a  b  d
list1  1  0 NA  1
list2  1  1  2 NA

NA values are treated as 0, so the result for id 1 should be:

id   a b   d
 1 0.5 1 0.5

Next, the top 3 column names are selected and ordered based on their values so that the result is:

id    top3
 1   b d a

This needs to be repeated for all id's. I have code that can achieve this (below), but for a large list with thousands of data tables and over a million ids it is very slow.

for (i in 1:6){ # i is the id to be searched for
  for (j in 1:length(list1)){
    if (i %in% list1[[j]]$id){
      listnum1 <- j
      rownum1 <- which(list1[[j]]$id==i)
      break
    }
  }

  for (j in 1:length(list2)){
    if (i %in% list2[[j]]$id){
      listnum2 <- j
      rownum2 <- which(list2[[j]]$id==i)
      break
    }
  }

  v1 <- data.table(setDF(list1[[listnum1]])[rownum1,]) # Converting to data.frame using setDF and extracting the row is faster than using data.table
  v2 <- data.table(setDF(list2[[listnum2]])[rownum2,])
  bind <- rbind(v1, v2, fill=TRUE) # Combines two rows and fills in columns they don't have in common
  for (j in 1:ncol(bind)){ # Convert NAs to 0
    set(bind, which(is.na(bind[[j]])), j, 0)}
  means <- colMeans(bind[,2:ncol(bind),with=F]) # Average the two rows
  col_ids <- as.data.table(t(names(sort(means)[length(means):(length(means)-2)])))

  # select and order the top 3 ids and bind to a data frame
  top3 <- rbind(top3, cbind(id=i, top3=data.table(do.call("paste", c(col_ids[,1:min(length(col_ids),3),with=F], sep=" ")))))
}

   id top3.V1
1:  1   b d a
2:  2   f c d
3:  3   d e c
4:  4   f c b
5:  5     a b
6:  6   e c b

When I run this code on my full data set (which has a few million IDs) it only makes it through about 400 ids after about 60 seconds. It would take days to go through the entire data set. Converting each list into 1 much larger data table is not an option; there are 100,000 possible columns so it becomes too large. Is there a faster way to achieve the desired result?

Branden Murray
  • 484
  • 2
  • 10
  • Get rid of the loops to speed it up. Here's a start, though it ignores `NA`s instead of replacing them with 0: `rbindlist(lapply(list(list1, list2), rbindlist, fill = TRUE))[, lapply(.SD, mean, na.rm = T), by = id]` – alistaire May 23 '16 at 07:12
  • A very important thing I forgot to mention (sorry!) is that there are few million IDs and over 100,000 different possible columns. The result of this is a data.table that is too large to hold in memory. That's what led me to my current memory efficient, but not very computationally efficient, solution. – Branden Murray May 23 '16 at 07:43

2 Answers2

4

Melt down the individual data.table's and you won't run into the issue of wasted memory:

rbindlist(lapply(c(list1, list2), melt, id.var = 'id', variable.factor = F))[
  # find number of "rows" per id
  , nvals := max(rle(sort(variable))$lengths), by = id][
  # compute the means, assuming that missing values are equal to 0
  , sum(value)/nvals[1], by = .(id, variable)][
  # extract top 3 values
  order(-V1), paste(head(variable, 3), collapse = " "), keyby = id]
#   id    V1
#1:  1 b a d
#2:  2 f c b
#3:  3 d e a
#4:  4 b c f
#5:  5   a b
#6:  6 e b c

Or instead of rle you can do:

rbindlist(lapply(c(list1, list2), melt, id.var = 'id'))[
  , .(vals = sum(value), nvals = .N), by = .(id, variable)][
  , vals := vals / max(nvals), by = id][
  order(-vals), paste(head(variable, 3), collapse = " "), keyby = id]

Or better yet, as Frank points out, don't even bother with the mean:

rbindlist(lapply(c(list1, list2), melt, id.var = 'id'))[
  , sum(value), by = .(id, variable)][
  order(-V1), paste(head(variable, 3), collapse = " "), keyby = id]
eddi
  • 49,088
  • 6
  • 104
  • 155
  • I think you'd have to iteratively calculate means (keeping track of `.N` so the weighting isn't off) collapsing each list into the one before instead of building one big dataframe. – alistaire May 23 '16 at 16:17
  • Oh, I just realized I thought there were thousands of lists, not just 2 with thousands of data.frames in them. You do get different numbers of `NA`s/zeros with `rbindlist(melt())` vs. `melt(rbindlist())`, though, which will mess with the results. I'm not wholly sure which way the OP wanted it... – alistaire May 23 '16 at 16:46
  • Unfortunately I'm still running into memory issues with this approach. Initially the two lists are ~52Gb, but melting them doubles their size, and R crashes during `rbindlist`. I even tried just loading one of this lists, melting it, and doing `rbindlist` and R crashes within a few seconds. – Branden Murray May 24 '16 at 03:07
  • @BrandenMurray I'm confused about "doubling of size" - perhaps you're not removing the old ones? Does `rm`-ing them (and maybe running `gc`) help at all? – eddi May 24 '16 at 05:01
  • @eddi, yes I am removing them and running gc(). The list is the only object in the environment. I'm checking the size of the list before and after using `format(object.size(list1), units="Gb")`. My first list is 24 Gb before melting and 47.8 Gb after. – Branden Murray May 24 '16 at 05:18
  • @BrandenMurray does this work instead: `rbindlist(lapply(c(list1, list2), function(x) x[, .(list(c(.SD))), by= id]))[, setDT(melt(c(V1[[1]], V1[[2]])))[, sum(value), by = L1][order(-V1), paste(head(L1, 3), collapse = " ")], by = id]` ? If it does the steps can probably be done better, I was just trying to minimize memory usage on each step (and there is one more step here that could use less memory at the expense of speed). – eddi May 24 '16 at 16:10
  • @eddi, that's causing R to crash as well. – Branden Murray May 25 '16 at 01:27
  • @BrandenMurray Which step crashes it? As far as I can tell only setDT adds a little bit of overhead, so if it crashes before that step something else besides memory usage might be the reason. – eddi May 25 '16 at 03:09
  • @eddi, it crashes during `lapply`. I tried running the `lapply` portion alone and I can see my memory use slowly rise over a few hours until it maxes out both my RAM and swap and then crashes. I _think_ I figured out why it crashes during `rbindlist` in the original answer -- the total number of rows in both lists after melting is ~6 billion, and from what I understand from [here](http://stackoverflow.com/questions/5233769/practical-limits-of-r-data-frame) R can only handle a maximum vector length of ~2 billion. – Branden Murray May 25 '16 at 21:41
  • @BrandenMurray ok, I don't know how to make the memory expenditure smaller without a very significant speed hit - this should be straightforward to code up in C++, so consider using Rcpp for this task – eddi May 25 '16 at 22:05
0

Not sure about the performance, but this should prevent the for-loop:

library(plyr)
library(dplyr)
a <- ldply(list1, data.frame)
b <- ldply(list2, data.frame)
dat <- full_join(a,b)

This will give you a single data frame:

   id  a  d  b  c  f  e
1   1  0  1 NA NA NA NA
2   3  0  1 NA NA NA NA
3   2 NA NA  1  2  3 NA
4   4 NA NA  0  1  1 NA
5   5  4 NA  2 NA NA NA
6   6  0 NA  1 NA NA NA
7   2 NA  1 NA  0 NA  0
8   3 NA  1 NA  0 NA  1
9   6 NA  0 NA  1 NA  2
10  1  1 NA  2 NA NA NA
11  4  0 NA  1 NA NA NA
12  5  3 NA  2 NA NA NA

By summarising based on id:

means <- function(x) mean(x, na.rm=T)
output <- dat %>% group_by(id) %>% summarise_each(funs(means))

     id     a     d     b     c     f     e
1     1   0.5     1   2.0    NA    NA    NA
2     2   NaN     1   1.0     1     3     0
3     3   0.0     1   NaN     0   NaN     1
4     4   0.0   NaN   0.5     1     1   NaN
5     5   3.5   NaN   2.0   NaN   NaN   NaN
6     6   0.0     0   1.0     1   NaN     2

Listing the top 3 through sapply will give you the same resulting table (but as a matrix, each column corresponding to id)

 sapply(1:nrow(output), function(x) sort(output[x,-1], decreasing=T)[1:3] %>% names)
    [,1] [,2] [,3] [,4] [,5] [,6]
[1,] "b"  "f"  "d"  "c"  "a"  "e" 
[2,] "d"  "d"  "e"  "f"  "b"  "b" 
[3,] "a"  "b"  "a"  "b"  NA   "c" 

** Updated **

Since the data is going to be large, it's prudent to create some functions that can choose and combine appropriate data.frame for each id.

(i) find out all the id present in both list

id_list1 <- lapply(list1, "[[", "id")
id_list2 <- lapply(list2, "[[", "id")

(ii) find out in which table ids 1 to 6 are within the list

id_l1<-lapply(1:6, function(x) sapply(id_list1, function(y) any(y==x) %>% unlist)) 
id_l2<-lapply(1:6, function(x) sapply(id_list2, function(y) any(y==x) %>% unlist)) 

(iii) create a function to combine appropriate dataframe for specific id

id_who<-function(x){
  a <- data.frame(list1[id_l1[[x]]])
  a <- a[a$id==x, ]
  b <- data.frame(list2[id_l2[[x]]]) 
  b <- b[b$id==x, ]
  full_join(a,b)
} 

lapply(1:6, id_who)
[[1]]
  id a  d  b
1  1 0  1 NA
2  1 1 NA  2

[[2]]
  id  b c  f  d  e
1  2  1 2  3 NA NA
2  2 NA 0 NA  1  0

[[3]]
  id a d c e
1  3 0 1 0 1

[[4]]
  id b  c  f  a
1  4 0  1  1 NA
2  4 1 NA NA  0

[[5]]
  id a b
1  5 4 2
2  5 3 2

[[6]]
  id a b c d e
1  6 0 1 1 0 2

output<-ldply(new, summarise_each, funs(means))

Output will be the same as the above.

The advantage of this process is that you can easily put in logical breaks in the process, either in (ii) or (iii).

Adam Quek
  • 6,973
  • 1
  • 17
  • 23
  • Thanks Adam. One problem that I just noted in my question is that there are over 100,000 possible columns and a few million IDs, so putting it all into a single data frame isn't possible as it is just too large. I have 128GB of RAM to to work with, but even doing `a <- ldply(list1, data.frame)` gets me up to my limit. – Branden Murray May 23 '16 at 07:26
  • I can emphatise working with big data. I'd offer an alternative solution within the edit. It should help in only grabbing appropriate dataframe within the list for each client. Hopefully this could help you find logical breaks in your processing of data. – Adam Quek May 23 '16 at 08:14