Optimization: splitting dataframe into a list of dataframes, transforming data per row

Question

Preliminaries: this question is mostly of educational value, the actual task at hand is completed, even if the approach is not entirely optimal. My question is whether the code below can be optimized for speed and/or implemented more elegantly. Perhaps using additional packages, such as plyr or reshape. Run on the actual data it takes about 140 seconds, much higher than the simulated data, since some of the original rows contain nothing but NA, and additional checks have to be made. To compare, the simulated data are processed in about 30 seconds.

Conditions: the dataset contains 360 variables, 30 times the set of 12. Let's name them V1_1, V1_2... (first set), V2_1, V2_2 ... (second set) and so forth. Each set of 12 variables contains dichotomous (yes/no) responses, in practice corresponding to a career status. For instance: work (yes/no), study (yes/no) and so forth, in total 12 statuses, repeated 30 times.

Task: the task at hand is to recode each set of 12 dichotomous variables into a single variable with 12 response categories (e.g. work, study... ). Ultimately we should get 30 variables, each with 12 response categories.

Data: I cannot post the actual dataset, but here is a good simulated approximation:

randomRow <- function() {
  # make a row with a single 1 and some NA's
  sample(x=c(rep(0,9),1,NA,NA),size=12,replace=F) 
}

# create a data frame with 12 variables and 1500 cases
makeDf <- function() {
  data <- matrix(NA,ncol=12,nrow=1500)
  for (i in 1:1500) {
    data[i,] <- randomRow()
  }
  return(data)
}

mydata <- NULL

# combine 30 of these dataframes horizontally
for (i in 1:30) {
  mydata <- cbind(mydata,makeDf())
}
mydata <- as.data.frame(mydata) # example data ready

My solution:

# Divide the dataset into a list with 30 dataframes, each with 12 variables
S1 <- lapply(1:30,function(i) {
  Z <- rep(1:30,each=12) # define selection vector
  mydata[Z==i]           # use selection vector to get groups of variables (x12)
})

recodeDf <- function(df) {
  result <- as.numeric(apply(df,1,function(x) {
    if (any(!is.na(df))) which(x == 1) else NA # return the position of "1" per row
  }))                                          # the if/else check is for the real data
  return(result)
}
# Combine individual position vectors into a dataframe
final.df <- as.data.frame(do.call(cbind,lapply(S1,recodeDf)))

All in all, there is a double *apply function, one across the list, the other across the dataframe rows. This makes it a bit slow. Any suggestions? Thanks in advance.

(+1) Very nicely framed question. – Arun Apr 10 '13 at 19:36 — Arun, Apr 10 '13 at 19:36

score 5 · Answer 1 · edited Apr 11 '13 at 08:44

Here is an approach that is basically instantaneous. (system.time = 0.1 seconds)

se set. The columnMatch component will depend on your data, but if it is every 12 columns, then the following will work.

MYD <- data.table(mydata)
# a new data.table (changed to numeric : Arun)
newDT <- as.data.table(replicate(30, numeric(nrow(MYD)),simplify = FALSE))
# for each column, which values equal 1
whiches <- lapply(MYD, function(x) which(x == 1))
# create a list of column matches (those you wish to aggregate)
columnMatch <- split(names(mydata), rep(1:30,each = 12))
setattr(columnMatch, 'names', names(newDT))

# cycle through all new columns
# and assign the the rows in the new data.table
## Arun: had to generate numeric indices for 
## cycling through 1:12, 13:24 in whiches[[.]]. That was the problem.
for(jj in seq_along(columnMatch)) {
 for(ii in seq_along(columnMatch[[jj]])) {
  set(newDT, j = jj, i = whiches[[ii + 12 * (jj-1)]], value = ii)
 }
}

This would work just as well adding columns by reference to the original.

Note set works on data.frames as well....

I am not sure what is wrong, but this code does not give me the result. Instead I get a data.table (newDT) which contains variable names instead of values. I imagine that these correspond to the values I seek, e.g. V1_8 refers to 8. Still a valuable suggestion with "set", thank you. — Maxim.K, Apr 11 '13 at 08:03
@mnel, brilliant answer. I've made some corrections. The access to `whiches[[.]]` was not right. It was going through the same 1:12 for every `jj` when for ex: for `jj = 2`, `ii` must be `13:24`. Hope you don't mind the edit. Feel free to edit/roll back if you're not convinced. Maxim, you should get the desired result now. And yes, it *is* fast! — Arun, Apr 11 '13 at 08:46

score 4 · Answer 2 · answered Apr 10 '13 at 19:34

IIUC, you've only one 1 per 12 columns. You've the rest with 0's or NA's. If so, the operation can be performed much faster by this idea.

The idea: Instead of going through each row and asking for the position of 1, you could use a matrix with dimensions 1500 * 12 where each row is just 1:12. That is:

mul.mat <- matrix(rep(1:12, nrow(DT)), ncol = 12, byrow=TRUE)

Now, you can multiply this matrix with each of your subset'd data.frame (of same dimensions, 1500*12 here) and them take their "rowSums" (which is vectorised) with na.rm = TRUE. This'll just give directly the row where you have 1 (because that 1 will have been multiplied by the corresponding value between 1 and 12).

data.table implementation: Here, I'll use data.table to illustrate the idea. Since it creates column by references, I'd expect that the same idea used on a data.frame would be a tad slower, although it should drastically speed up your current code.

require(data.table)
DT <- data.table(mydata)
ids <- seq(1, ncol(DT), by=12)

# for multiplying with each subset and taking rowSums to get position of 1
mul.mat <- matrix(rep(1:12, nrow(DT)), ncol = 12, byrow=TRUE)

for (i in ids) {
    sdcols <- i:(i+12-1)
    # keep appending the new columns by reference to the original data
    DT[, paste0("R", i %/% 12 + 1) := rowSums(.SD * mul.mat, 
                     na.rm = TRUE), .SDcols = sdcols]
}
# delete all original 360 columns by reference from the original data
DT[, grep("V", names(DT), value=TRUE) := NULL]

Now, you'll be left with 30 columns that correspond to the position of 1's. On my system, this takes about 0.4 seconds.

all(unlist(final.df) == unlist(DT)) # not a fan of `identical`
# [1] TRUE

Thanks, Arun. Matrix multiplication is a brilliant idea, I was not even thinking in that direction. Intuitively I expected some kind of neat trick with plyr or reshape, but your suggestion to use data.table is also a very welcome discovery indeed. — Maxim.K, Apr 10 '13 at 21:30

nograpes · Accepted Answer · 2013-04-11T15:03:52.670

I really like @Arun's matrix multiplication idea. Interestingly, if you compiling R against some OpenBLAS libraries, you could get this to operate in parallel.

However, I wanted to provide you with another, perhaps slower than matrix multiplication, solution that uses your original pattern, but is much faster than your implementation:

# Match is usually faster than which, because it only returns the first match 
# (and therefore won't fail on multiple matches)
# It also neatly handles your *all NA* case
recodeDf2 <- function(df) apply(df,1,match,x=1) 
# You can split your data.frame by column with split.default
# (Using split on data.frame will split-by-row)
S2<-split.default(mydata,rep(1:30,each=12))
final.df2<-lapply(S2,recodeDf2)

If you had a very large data frame, and many processors, you may consider parallelizing this operation with:

library(parallel)
final.df2<-mclapply(S2,recodeDf2,mc.cores=numcores) 
# Where numcores is your number of processors.

Having read @Arun and @mnel, I learned a lot about how to improve this function, by avoiding the coercion to an array, by processing the data.frame by column instead of by row. I don't mean to "steal" an answer here; OP should consider switching the checkbox to @mnel's answer.

I wanted, however, to share a solution that doesn't use data.table, and avoids for. It is still, however, slower than @mnel's solution, albeit slightly.

nograpes2<-function(mydata) {
  test<-function(df) {
    l<-lapply(df,function(x) which(x==1))
    lens<-lapply(l,length)
    rep.int(seq.int(l),times=lens)[order(unlist(l))]
  }
  S2<-split.default(mydata,rep(1:30,each=12))
  data.frame(lapply(S2,test))
}

I would also like to add that @Aaron's approach, using which with arr.ind=TRUE would also be very fast and elegant, if mydata started out as a matrix, rather than a data.frame. Coercion to a matrix is slower than the rest of the function. If speed were an issue, it would be worth considering reading the data in as a matrix in the first place.

nograpes, (+1) Thanks. In my experience with parallel jobs, unless the task you're parallelising is "heavy", the overhead to create jobs and combine results after completion is *much higher* that they turn out to be slower. It'd be interesting to benchmark on 1 processor and a cluster of processors. I don't think the actual operation is "heavy" here. I'll try to do it if I manage to squeeze some time. — Arun, Apr 10 '13 at 21:43
Thank you. I also liked @Arun's suggestion about matrix multiplication. I find your code more robust for real data application though. The multiplication approach depends on the cleanness of the data, otherwise the row sum would be incorrect. I did my best to remove the irregularities, but one can never know. The code does very well in terms of speed, 0.25 seconds. Great suggestions. — Maxim.K, Apr 10 '13 at 21:44
Using apply on a data.frame will coerce to a matrix, this is not efficient. — mnel, Apr 11 '13 at 00:26
@mnel, I just realized `rowSums` also coerces to a matrix. Very nice use of column operations. Your solution is by far the fastest. — Arun, Apr 11 '13 at 09:26

score 4 · Answer 4 · answered Apr 11 '13 at 01:54

4

Another way this could be done with base R is with simply getting the values you want to put in the new matrix and filling them in directly with matrix indexing.

idx <- which(mydata==1, arr.ind=TRUE)   # get indices of 1's
i <- idx[,2] %% 12                      # get column that was 1
idx[,2] <- ((idx[,2] - 1) %/% 12) + 1   # get "group" and put in "col" of idx
out <- array(NA, dim=c(1500,30))        # make empty matrix
out[idx] <- i                           # and fill it in!

answered Apr 11 '13 at 01:54

Aaron left Stack Overflow

36,704
7
77
142

A very interesting approach, thank you. Unfortunately, it does not work with the original data, in all likelihood due to the fact that some rows only contain NA. It does work very well indeed with the simulated data, and of course the actual data can be adjusted. – Maxim.K Apr 11 '13 at 09:12
ADDENDUM: it actually does work with the original data, not sure what went wrong the first time. Thanks again. – Maxim.K Apr 11 '13 at 09:28

Optimization: splitting dataframe into a list of dataframes, transforming data per row

4 Answers4