1

I'm trying to do a simper analysis (vegan package) on a large dataset with R; I've had some success running it on a local machine (10 core, 16GB ram) with smaller dataset. However, as I expand my analysis to include larger dataset, the code terminates with an error such as:

error: cannot allocate vector of size XX gb

So, I tried the same analysis with an Amazon AWS instance (more specifically, an r3.8xlarge instance: 32 core, 244GB ram) and I'm getting the same error, this time most specifically:

error: cannot allocate vector of size 105.4 gb

Both systems I've tried on (local and AWS) are Ubuntu boxes and have a sessionInfo() of

R version 3.0.2 (2013-09-25)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

Here are the relevant lines of code I'm running:

# read in data as DFs for mapping file
print("Loading mapping file...")
map_df = read.table(map, sep="\t", header=TRUE, strip.white=T)
rownames(map_df) = map_df[,1] # make first column the index so that we can join on it
map_df[,1] <- NULL # remove first column (we just turned it into the index)

# read in data as DF for biom file
print("Loading biom file...")
biom_df = data.frame(read.table(biom_file, sep="\t", header=TRUE), stringsAsFactors=FALSE)
biom_cols = dim(biom_df)[2] # number of columns in biom file, represents all the samples
otu_names <- as.vector(biom_df[,biom_cols]) # get otu taxonomy (last column) and save for later
biom_df[,biom_cols] <- NULL # remove taxonomy column
biom_df <- t(biom_df) # transpose to get OTUs as columns
biom_cols = dim(biom_df)[2] # number of columns in biom file, represents all the OTUs (now that we've transposed)

# merge our biom_df with map_df so that we reduce the samples down to those given in map_df
merged = merge(biom_df, map_df, by="row.names")
merged_cols = dim(merged)[2]

# clear some memory
rm(biom_df)
print("Total memory used:")
print(object.size(x=lapply(ls(), get)), units="Mb")


# simper analysis
print("Running simper analysis...")
sim <- simper(merged[,2:(biom_cols+1)], merged[,merged_cols], parallel=10)

Any thoughts?

Constantino
  • 2,243
  • 2
  • 24
  • 41

1 Answers1

0

It's not clear at which point your machine runs out of memory from the information you provided. You seem to be using base R functions in your analysis. You might want to give the data.table package a try (check out the fread function which is much faster than read.table).

Caner
  • 678
  • 2
  • 8
  • 11
  • it runs out somewhere during the simper() call – Constantino Jan 06 '15 at 16:47
  • It seems simper only needs the merged dataframe, biom_cols and merged_cols. Have you tried deleting everything from memory except those? You should be able to do that with rm(list=setdiff(ls(), c("merged", "biom_cols", "merged_cols"))). – Caner Jan 06 '15 at 17:15
  • of the two dataframes I create (biom_df & map_df) only biom_df is very large (map_df is very small). I've already included a `rm(biom_df)` to clear up memory, therefore there isn't much other memory to clear (I don't think) – Constantino Jan 06 '15 at 17:59
  • Without knowing what the sizes of the data frames you are dealing with are it is hard to say whether it would be helpful or not. You can also try creating the two input data frames separately before feeding it into simper() and getting rid of merged. This won't be helpful though if the union of merged_cols and biom_cols equals all the columns in merged. – Caner Jan 06 '15 at 18:16
  • You might also try to take advantage of data.table's pass by reference property so that you would minimize duplications of dataframes in memory. I'm not sure what the exact mechanics are but found this article [link](http://stackoverflow.com/questions/10225098/understanding-exactly-when-a-data-table-is-a-reference-to-vs-a-copy-of-another). – Caner Jan 06 '15 at 18:21