7

I'm running out of memory on a normal 8GB server working with a fairly small dataset in a machine learning context:

> dim(basetrainf) # this is a dataframe
[1] 58168   118

The only pre-modeling step I take which significantly increases memory consumption is convert a data frame to a model matrix. This is since caret, cor, etc. only work with (model) matrices. Even after removing factors with many levels, the matrix (mergem below) is fairly large. (sparse.model.matrix/Matrix is poorly supported in general, so I can't use that.)

> lsos()
                 Type      Size PrettySize   Rows Columns
mergem         matrix 879205616   838.5 Mb 115562     943
trainf     data.frame  80613120    76.9 Mb 106944     119
inttrainf      matrix  76642176    73.1 Mb    907   10387
mergef     data.frame  58264784    55.6 Mb 115562      75
dfbase     data.frame  48031936    45.8 Mb  54555     115
basetrainf data.frame  40369328    38.5 Mb  58168     118
df2        data.frame  34276128    32.7 Mb  54555     103
tf         data.frame  33182272    31.6 Mb  54555      98
m.gbm           train  20417696    19.5 Mb     16      NA
res.glmnet       list  14263256    13.6 Mb      4      NA

Also, since many R models don't support example weights, I had to first oversample the minority class, doubling the size of my dataset (why trainf, mergef, mergem have twice as many rows as basetrainf).

R is at this point using 1.7GB of memory, bringing my total memory usage up to 4.3GB out of 7.7GB.

The next thing I do is:

> m = train(mergem[mergef$istrain,], mergef[mergef$istrain,response], method='rf')

Bam - in a few seconds, the Linux out-of-memory killer kills rsession.

I can sample my data, undersample instead of oversample, etc., but these are non-ideal. What (else) should I do (differently), short of rewriting caret and the various model packages I intend to use?

FWIW, I've never run into this problem with other ML software (Weka, Orange, etc.), even without pruning out any of my factors, perhaps because of both example weighting and "data frame" support, across all models.

Complete script follows:

library(caret)
library(Matrix)
library(doMC)
registerDoMC(2)

response = 'class'

repr = 'dummy'
do.impute = F

xmode = function(xs) names(which.max(table(xs)))

read.orng = function(path) {
  # read header
  hdr = strsplit(readLines(path, n=1), '\t')
  pairs = sapply(hdr, function(field) strsplit(field, '#'))
  names = sapply(pairs, function(pair) pair[2])
  classes = sapply(pairs, function(pair)
    if (grepl('C', pair[1])) 'numeric' else 'factor')

  # read data
  dfbase = read.table(path, header=T, sep='\t', quote='', col.names=names, na.strings='?', colClasses=classes, comment.char='')

  # switch response, remove meta columns
  df = dfbase[sapply(pairs, function(pair) !grepl('m', pair[1]) && pair[2] != 'class' || pair[2] == response)]

  df
}

train.and.test = function(x, y, trains, method) {
  m = train(x[trains,], y[trains,], method=method)
  ps = extractPrediction(list(m), testX=x[!trains,], testY=y[!trains,])
  perf = postResample(ps$pred, ps$obs)
  list(m=m, ps=ps, perf=perf)
}

# From 
sparse.cor = function(x){
  memory.limit(size=10000)
  n 200 levels')
badfactors = sapply(mergef, function(x)
  is.factor(x) && (nlevels(x)  200))
mergef = mergef[, -which(badfactors)]

print('remove near-zero variance predictors')
mergef = mergef[, -nearZeroVar(mergef)]

print('create model matrix, making everything numeric')
if (repr == 'dummy') {
  dummies = dummyVars(as.formula(paste(response, '~ .')), mergef)
  mergem = predict(dummies, newdata=mergef)
} else {
  mat = if (repr == 'sparse') model.matrix else sparse.model.matrix
  mergem = mat(as.formula(paste(response, '~ .')), data=mergef)
  # remove intercept column
  mergem = mergem[, -1]
}

print('remove high-correlation predictors')
merge.cor = (if (repr == 'sparse') sparse.cor else cor)(mergem)
mergem = mergem[, -findCorrelation(merge.cor, cutoff=.75)]

print('try a couple of different methods')
do.method = function(method) {
  train.and.test(mergem, mergef[response], mergef$istrain, method)
}
res.gbm = do.method('gbm')
res.glmnet = do.method('glmnet')
res.rf = do.method('parRF')
Yang
  • 16,037
  • 15
  • 100
  • 142
  • 1
    Did you end up switching software, or coming up with a solution in R? I'd be curious to hear what some of your more promising approaches were, as I'm having similar issues. I plan to use increasing higher-spec'd EC2 machines because they're convenient and I know R very well (until I need to implement some other solution). – lockedoff May 03 '12 at 19:38
  • 1
    @lockedoff I ended up just doing a lot more subsampling (one of the "non-ideal" solutions I mentioned - which should also include "buy more RAM")! – Yang May 04 '12 at 01:29
  • 1
    I am now able to evaluate 3x3x3 parameter grids using `caret` on a 350,000 x 30 dataframe fairly quickly. This was killing my 8GB quadcore macbook pro when running in parallel (each core was using too much memory), but yesterday I found out that it runs very fast on Amazon's High-Memory Double Extra Large Instance (http://aws.amazon.com/ec2/instance-types/ @ at about $0.42/hr as a spot instance). – lockedoff May 08 '12 at 16:13

3 Answers3

7

With that much data, the resampled error estimates and the random forest OOB error estimates should be pretty close. Try using trainControl(method = "OOB") and train() will not fit the extra models on resampled data sets.

Also, avoid the formula interface like the plague.

You also might try bagging instead. Since there is no random selection of predictors at each spit, you can get good result with 50-100 resamples (instead of many more needed by random forests to be effective).

Others may disagree, but I also think that modeling all the data you have is not always the best approach. Unless the predictor space is large, many of the data points will be very similar to others and don't contribute much to the model fit (besides the additional computation complexity and the footprint of the resulting object). caret has a function called maxDissim that might be helpful to thinning the data (although it is not terribly efficient either)

Max
  • 391
  • 2
  • 2
6

Check that the underlying randomForest code is not storing the forest of trees. Perhaps reduce the tuneLength so that fewer values of mtry are being tried.

Also, I would probably just fit a single random forest by hand to see if I could fit such a model on my machine. If you can't fit one directly, you won't be able to use caret to fit many in one go.

At this point I think you need to work out what is causing the memory to balloon and how you might control the model fitting so it doesn't balloon out of control. So work out how caret is calling randomForest() and what options it is using. You might be able to turn some of those off (like storing the forest I mentioned earlier, but also the variable importance measures). Once you've determined the optimal value for mtry, you can then try to fit the model with all the extras you might want to help interpret the fit.

Gavin Simpson
  • 170,508
  • 25
  • 396
  • 453
  • I was afraid you'd say this. randomForest itself does consume a ridiculous amount of memory. ntree=500 gives "Error: cannot allocate vector of size 384.7 Mb." ntree=200 works, but nearly maxes out memory. Looks like I'd have to specially treat RF (and other models like GBM), or just ditch R. Argh, why is everything in R taking up so much memory? I was really hoping I was doing something wrong or missing something. I'll mark your answer as accepted if I don't hear anything else. – Yang Jun 23 '11 at 21:13
  • 1
    @Yang did you call that with `keep.forest = FALSE`? If not, do that. Also, did you fit using the formula interface or the normal interface? Make sure you are using matrices, and not data frames. Why does `mergmem` have twice as many rows as `basetrainf`? I understand why the number of columns is larger, but not why there are twice as many rows. Showing us **exactly** what you did helps, so that we aren't left guessing. Please edit your Q with example code you tried (the actual calls). – Gavin Simpson Jun 23 '11 at 21:21
  • @Yang R holds all objects in memory and functions can quite often copy objects when they are sliced (subset) or reassigned. This is an issue with R in general but there are ways round some of the problems and you could always throw more RAM at a problem these days. – Gavin Simpson Jun 23 '11 at 21:23
  • Scratch all that. Even ntree=200 eventually triggers the OOM killer. So what I said earlier should read: "Looks like I'd have to either rewrite RF (and other models like GBM) or ditch R." I'm editing my answer now to provide more details, but the reason why mergem has 2x the rows of basetrainf is due to the oversampling I mentioned. Also I'm using matrices. – Yang Jun 23 '11 at 22:22
  • Posted full source code. Also, I actually *want* the forest to be kept, so that I can perform predictions on other data. – Yang Jun 23 '11 at 22:36
  • @Yang you've posted code that uses **caret**. When you say above that you did RF with `ntree = 200`, were you calling the code yourslef as `randomForest(X, Y, ntree = 200, mtry = 10)` say or where you using the code you pasted? If the latter try the former. As for the weights, `randomForest` does have `classwt` and `strata` to help with sampling in the smaller classes. – Gavin Simpson Jun 23 '11 at 22:42
  • At this point, you probably need to seek expert assistance from the author of `randomForest()` - people do use R for these sorts of tasks on large data sets. Second, you could try profiling the memory used, but for that you'll need to compile R with support for profiling memory. Clean up your workspace before calling the code - if you don't need the other objects in the R workspace get rid of them. Also, try to reduce the other RAM that is in use. What is occupying the other 2.6GB of your RAM? – Gavin Simpson Jun 23 '11 at 22:46
  • @Gavin I was calling randomForest directly (not shown in the pasted source). Thanks for the tip about classwt/strata. The other 2.6GB of RAM is occupied by other users on the server. – Yang Jun 24 '11 at 00:03
  • I haven't seen anyone mention the `nodesize` parameter yet in `randomForest`, which you can set to control the size of the trees grown. It often isn't necessary to grow 'full' trees with this much data. Also, if you want to perform predictions on new data w/out keeping the forest in memory, you can use the `xtest` argument (if you have the new data in hand, obviously). – joran Jul 02 '11 at 05:42
  • Also, it's often not necessary (with large datasets) to resample the entire data set to get good performance from `randomForest`. If you set `sampsize` to something reasonable you'll likely see a dramatice reduction in memory use... – joran Jul 02 '11 at 05:54
-1

You can try to use the ff package, which implements "memory-efficient storage of large data on disk and fast access functions".

rafalotufo
  • 3,862
  • 4
  • 25
  • 28
  • 11
    Don't randomly throw recommendations for ff or bigmemory out there. The OP asked about help with the caret package and neither ff nor bigmemory work with it. So this answer is somewhere between off-base and misleading. – Dirk Eddelbuettel Jun 23 '11 at 13:28