Running two instances of R in order to improve large data reading performance

Question

I would like to read-in a number of CSV files (~50), run a number of operations, and then use write.csv() to output a master file. Since the CSV files are on the larger side (~80 Mb), I was wondering if it might be more efficient to open two instances of R, reading-in half the CSVs on one instance, and half on the other. Then I would write each to a large CSV, read-in both CSVs, and combine them into a master CSV. Does anyone know if running two instances of R will improve the time it takes to read-in all the csv's?

I'm using a Macbook Pro OSX 10.6 with 4Gb RAM.

I can tell you by experience writing a 80Mb csv is not very slow. But this on the other hand depends on what do you mean by "slow" in your context. These questions might be helpful: http://stackoverflow.com/questions/12013953/write-csv-for-large-data-table-in-r and http://stackoverflow.com/questions/9703068/most-efficient-way-of-exporting-large-3-9-mill-obs-data-frames-to-text-file — JEquihua, Jul 18 '13 at 17:34
Those are helpful, but the problem that I'm referring to is the lag in _loading_ the csv files. — Jacob Rosenberg-Wohl, Jul 18 '13 at 17:46

Joshua Ulrich · Answer 1 · 2013-07-18T17:28:23.337

2

If the majority of your code execution time is spent reading the files, then it will likely be slower because the two R processes will be competing for disk I/O. But it would be faster if the majority of the time is spent "running a number of operations".

edited Jul 18 '13 at 17:28

answered Jul 18 '13 at 17:22

Joshua Ulrich

173,410
32
338
418

Also number of cores matters. R is not educated in terms of multicore processing and each instance given the "operations" will stuff one core. – Areza Jul 18 '13 at 18:28

Karl Forner · Accepted Answer · 2013-07-19T12:26:43.210

1

read.table() and related can be quite slow. The best way to tell if you can benefit from parallelization is to time your R script, and the basic reading of your files. For instance, in a terminal:

time cat *.csv > /dev/null

If the "cat" time is significantly lower, your problem is not I/O bound and you may parallelize. In which case you should probably use the parallel package, e.g

library(parallel)
csv_files <- c(.....)
my_tables <- mclapply(csv_files, read.csv)

edited Jul 19 '13 at 12:26

answered Jul 18 '13 at 17:54

Karl Forner

4,175
25
32

I'm not sure this works, it looks like read.csv blocks concurrent reads, but I'm not sure. – reptilicus Aug 28 '13 at 18:42

Running two instances of R in order to improve large data reading performance

2 Answers2