I am working on an analysis of big data, which is based on social network data combined with data on the social network users from other internal sources, such as a CRM database.
I realize there are a lot of good memory profiling, CPU benchmarking, and HPC packages and code snippets out there. I'm currently using the following:
system.time()
to measure the current CPU usage of my functionsRprof(tf <- "rprof.log", memory.profiling=TRUE)
to profile memory usageRprofmem("Rprofmem.out", threshold = 10485760)
to log objects that exceed 10MBrequire(parallel)
to give me multicore and parallel functionality for use in my functionssource('http://rbenchmark.googlecode.com/svn/trunk/benchmark.R')
to benchmark CPU usage differences in single core and parallel modessort( sapply(ls(),function(x){format(object.size(get(x)), units = "Mb")}))
to list object sizesprint(object.size(x=lapply(ls(), get)), units="Mb")
to give me total memory used at the completion of my script
The tools above give me lots of good data points and I know that many more tools exist to provide related information as well as to minimize memory use and make better use of HPC/cluster technologies, such as those mentioned in this StackOverflow post and from CRAN's HPC task view. However, I don't know a straighforward way to synthesize this information and forecast my CPU, RAM and/or storage memory requirements as the size of my input data increases over time from increased usage of the social network that I'm analyzing.
Can anyone give examples or make recommendations on how to do this? For instance, is it possible to make a chart or a regression model or something like that that shows how many CPU cores I will need as the size of my input data increases, holding constant CPU speed and amount of time the scripts should take to complete?