1

I would like to read-in a number of CSV files (~50), run a number of operations, and then use write.csv() to output a master file. Since the CSV files are on the larger side (~80 Mb), I was wondering if it might be more efficient to open two instances of R, reading-in half the CSVs on one instance, and half on the other. Then I would write each to a large CSV, read-in both CSVs, and combine them into a master CSV. Does anyone know if running two instances of R will improve the time it takes to read-in all the csv's?

I'm using a Macbook Pro OSX 10.6 with 4Gb RAM.

Waldir Leoncio
  • 10,853
  • 19
  • 77
  • 107
  • 1
    I can tell you by experience writing a 80Mb csv is not very slow. But this on the other hand depends on what do you mean by "slow" in your context. These questions might be helpful: http://stackoverflow.com/questions/12013953/write-csv-for-large-data-table-in-r and http://stackoverflow.com/questions/9703068/most-efficient-way-of-exporting-large-3-9-mill-obs-data-frames-to-text-file – JEquihua Jul 18 '13 at 17:34
  • Those are helpful, but the problem that I'm referring to is the lag in _loading_ the csv files. – Jacob Rosenberg-Wohl Jul 18 '13 at 17:46
  • 4
    Have you looked at `fread` in the data.table package? – Dirk Eddelbuettel Jul 18 '13 at 19:53

2 Answers2

2

If the majority of your code execution time is spent reading the files, then it will likely be slower because the two R processes will be competing for disk I/O. But it would be faster if the majority of the time is spent "running a number of operations".

Joshua Ulrich
  • 173,410
  • 32
  • 338
  • 418
  • Also number of cores matters. R is not educated in terms of multicore processing and each instance given the "operations" will stuff one core. – Areza Jul 18 '13 at 18:28
1

read.table() and related can be quite slow. The best way to tell if you can benefit from parallelization is to time your R script, and the basic reading of your files. For instance, in a terminal:

time cat *.csv > /dev/null

If the "cat" time is significantly lower, your problem is not I/O bound and you may parallelize. In which case you should probably use the parallel package, e.g

library(parallel)
csv_files <- c(.....)
my_tables <- mclapply(csv_files, read.csv)
Karl Forner
  • 4,175
  • 25
  • 32