6

In Revolution R 2.12.2 on Windows 7 and Ubuntu 64-bit 11.04 I have a data frame with over 100K rows and over 100 columns, and I derive ~5 columns (sqrt, log, log10, etc) for each of the original columns and add them to the same data frame. Without parallelism using foreach and %do%, this works fine, but it's slow. When I try to parallelize it with foreach and %dopar%, it will not access the global environment (to prevent race conditions or something like that), so I cannot modify the data frame because the data frame object is 'not found.'

My question is how can I make this faster? In other words, how to parallelize either the columns or the transformations?

Simplified example:

require(foreach)    
require(doSMP)
w <- startWorkers()
registerDoSMP(w)

transform_features <- function()
{    
    cols<-c(1,2,3,4) # in my real code I select certain columns (not all)

    foreach(thiscol=cols, mydata) %dopar% { 
        name <- names(mydata)[thiscol]
        print(paste('transforming variable ', name))
        mydata[,paste(name, 'sqrt', sep='_')] <<- sqrt(mydata[,thiscol])
            mydata[,paste(name, 'log', sep='_')] <<- log(mydata[,thiscol])
    }
}


n<-10 # I often have 100K-1M rows
mydata <- data.frame(
    a=runif(n,1,100),
    b=runif(n,1,100),
    c=runif(n,1,100),
    d=runif(n,1,100)
    )

ncol(mydata) # 4 columns

transform_features()

ncol(mydata) # if it works, there should be 8

Notice if you change %dopar% to %do% it works fine

Andrew
  • 1,619
  • 3
  • 19
  • 24

3 Answers3

2

Try the := operator in data.table to add the columns by reference. You'll need with=FALSE so you can put the call to paste on the LHS of :=.

See When should I use the := operator in data.table?

Community
  • 1
  • 1
Matt Dowle
  • 58,872
  • 22
  • 166
  • 224
1

Might it be easier if you did something like

n<-10
mydata <- data.frame(
    a=runif(n,1,100),
    b=runif(n,1,100),
    c=runif(n,1,100),
    d=runif(n,1,100)
    )

mydata_sqrt <- sqrt(mydata)  
colnames(mydata_sqrt) <- paste(colnames(mydata), 'sqrt', sep='_')

mydata <- cbind(mydata, mydata_sqrt)

producing something like

> mydata
           a         b         c        d   a_sqrt   b_sqrt   c_sqrt   d_sqrt
1  29.344088 47.232144 57.218271 58.11698 5.417018 6.872565 7.564276 7.623449
2   5.037735 12.282458  3.767464 40.50163 2.244490 3.504634 1.940996 6.364089
3  80.452595 76.756839 62.128892 43.84214 8.969537 8.761098 7.882188 6.621340
4  39.250277 11.488680 38.625132 23.52483 6.265004 3.389496 6.214912 4.850240
5  11.459075  8.126104 29.048527 76.17067 3.385126 2.850632 5.389669 8.727581
6  26.729365 50.140679 49.705432 57.69455 5.170045 7.081008 7.050208 7.595693
7  42.533937  7.481240 59.977556 11.80717 6.521805 2.735186 7.744518 3.436157
8  41.673752 89.043099 68.839051 96.15577 6.455521 9.436265 8.296930 9.805905
9  59.122106 74.308573 69.883037 61.85404 7.689090 8.620242 8.359607 7.864734
10 24.191878 94.059012 46.804937 89.07993 4.918524 9.698403 6.841413 9.438217
Henry
  • 6,704
  • 2
  • 23
  • 39
  • Thanks for the idea. I didn't think about it that way. This code is simpler but not parallel (I have 8 real CPU cores), so I would have to think about how to make it parallel. Also in my real case I only apply the transformation (sqrt, log, etc) to about half the columns in the data frame, so accomplishing that may wipe out the simplicity. – Andrew Oct 26 '11 at 23:47
1

There are two ways you can handle this:

  1. Loop over each column (or, better yet, a subset of the columns) and apply the transformations to create a temporary data frame, return that, and then do cbind of the list of data frames, as @Henry suggested.

  2. Loop over the transformations, apply each to the data frame, and then return the transformation data frames, cbind, and proceed.

Personally, the way I tend to do things like this is create a bigmatrix object (either in memory or on disk, using the bigmemory package), and you can access all of the columns in shared memory. Just pre-allocate the columns you will fill in, and you won't need to do a post hoc cbind. I tend to do it on disk. Just be sure to run flush(), to make sure everything is written to disk.

Iterator
  • 20,250
  • 12
  • 75
  • 111
  • Originally I didn't think about looping over transformations. This may be easier. I'll look into your suggestions. – Andrew Oct 27 '11 at 03:32