Estimating CPU and Memory Requirements for a Big Data Project

Question

I am working on an analysis of big data, which is based on social network data combined with data on the social network users from other internal sources, such as a CRM database.

I realize there are a lot of good memory profiling, CPU benchmarking, and HPC packages and code snippets out there. I'm currently using the following:

system.time() to measure the current CPU usage of my functions
Rprof(tf <- "rprof.log", memory.profiling=TRUE) to profile memory usage
Rprofmem("Rprofmem.out", threshold = 10485760) to log objects that exceed 10MB
require(parallel) to give me multicore and parallel functionality for use in my functions
source('http://rbenchmark.googlecode.com/svn/trunk/benchmark.R') to benchmark CPU usage differences in single core and parallel modes
sort( sapply(ls(),function(x){format(object.size(get(x)), units = "Mb")})) to list object sizes
print(object.size(x=lapply(ls(), get)), units="Mb") to give me total memory used at the completion of my script

The tools above give me lots of good data points and I know that many more tools exist to provide related information as well as to minimize memory use and make better use of HPC/cluster technologies, such as those mentioned in this StackOverflow post and from CRAN's HPC task view. However, I don't know a straighforward way to synthesize this information and forecast my CPU, RAM and/or storage memory requirements as the size of my input data increases over time from increased usage of the social network that I'm analyzing.

Can anyone give examples or make recommendations on how to do this? For instance, is it possible to make a chart or a regression model or something like that that shows how many CPU cores I will need as the size of my input data increases, holding constant CPU speed and amount of time the scripts should take to complete?

There is no one, right way to do this and everyone may have a different opinion. As such this question seems off-topic for this site. Ultimately you never really know what works till you test it. — MrFlick, Jun 06 '14 at 20:36
Argh. Mr. Flick, my own IRC superhero, is shooting me down! ;) There was a part of me that suspected the question might be too open ended. But can't I see if anyone has at least /some/ way of doing this, even if it's not the one right way? I have no clue where to start and I can't find many helpful resources on this. Wouldn't it be a useful post for readers to reference if it got some good suggested answers? Or would it be more appropriate for a some other StackExchange site? Where should I take this question? — Hack-R, Jun 06 '14 at 20:43
Search SO on : `[r] efficiency` and then sort by votes. And then expand the strategy. Other useful places to search would be Rseek and the R-help search facilities, e.g.: http://markmail.org/search/?q=list%3Aorg.r-project.r-help — IRTFM, Jun 07 '14 at 01:02

Estimating CPU and Memory Requirements for a Big Data Project

0 Answers0