8

I have 15 million CSV files, each with two columns (integer and float), and between 5 and 500 rows. Each file looks something like:

3453,0.034
31,0.031
567,0.456
...

Currently, I am iterating over all the files, and using read.csv() to import each file into a big list. Here's a simplified version:

allFileNames = Sys.glob(sprintf("%s/*/*/results/*/*", dir))

s$scores = list()

for (i in 1:length(allFileNames)){
        if ((i %% 1000) == 0){
            cat(sprintf("%d of %d\n", i, length(allFileNames)))
        }

        fileName = allFileNames[i]
        approachID = getApproachID(fileName) 
        bugID = getBugID(fileName)

        size = file.info(fileName)$size
        if (!is.na(size) && size > 0){ # make sure file exists and is not empty
            tmp = read.csv(fileName, header=F, colClasses=c("integer", "numeric"))
            colnames(tmp) = c("fileCode", "score")
            s$scores[[approachID]][[bugID]]  = tmp
        } else {
            # File does not exist, or is empty. 
            s$scores[[approachID]][[bugID]] = matrix(-1, ncol=2, nrow=1)
        }
    }

tmp = read.csv(fileName, header=F, colClasses=c("integer", "numeric")

Later in my code, I go back through each matrix in the list, and calculate some metrics.

After starting this import process, it looks like it will take on the order of 3 to 5 days to complete. Is there a faster way to do this?

EDIT: I added more details about my code.

stepthom
  • 1,432
  • 2
  • 16
  • 27
  • 1
    possible duplicate of [Quickly reading very large tables as dataframes in R](http://stackoverflow.com/questions/1727772/quickly-reading-very-large-tables-as-dataframes-in-r) – joran Mar 23 '12 at 16:23
  • 1
    What are you trying to do? Built up a single matrix with all of the data, or read in and process each matrix separately? – Jeff Allen Mar 23 '12 at 16:32
  • This is relevant to loading many files at once: http://stackoverflow.com/questions/3764292/ – Ben Mar 23 '12 at 16:41
  • @joran I don't know if it's a duplicate question, precisely, although many of the answers there will certainly be useful here. – Matt Parker Mar 23 '12 at 16:45
  • @joran: Not quite a dupe, since this involves many files but OP should definitiely check out the link. – Richie Cotton Mar 23 '12 at 16:48
  • My bad, read a little too fast. As others have said, probably relevant though. – joran Mar 23 '12 at 16:50

4 Answers4

7

I'm not clear on your goal, but if you're trying to read all of these files into a single R data structure, then I see two major performance concerns:

  1. File access times - from the moment you request read.csv, a myriad of complex processes start on your machine involving seeing if that file exists, finding the location of that file in memory or on disk (and reading the data into memory, if need be), then interpreting the data within R. I would expect that this would be a nearly-constant slowdown as you read in millions of files.
  2. Growing your single data structure with each new file read. Every time you want to add a few rows to your matrix, you'll likely be needing to reallocate a similarly sized chunk of memory in order to store the larger matrix. If you're growing your array 15 million times, you'll certainly notice a performance slow-down here. With this problem, the performance will get progressively worse as your read in more files.

So do some quick profiling and see how long the reads are taking. If they're slowing down progressively as you read in more files, then let's focus on problem #2. If it's constantly slow, then let's worry about problem #1.

Regarding solutions, I'd say you could start with two things:

  1. Combine the CSV files in another programming language. A simple shell script would likely do the job for you if you're just looping through files and concatenating them into a single large file. As Joshua and Richie mention below, you may be able to optimize this without having to deviate to another language by using the more efficient scan() or readlines() functions.
  2. Pre-size your unified data structure. If you're using a matrix, for instance, set the number of rows to ~ 15 million x 100. That will ensure that you only have to find room in memory for this object once, and the rest of the operations will just insert data into the pre-sized matrix.

Add some more details of your code (what does the list look like that you're using?) and we may be able to be more helpful.

Jeff Allen
  • 17,277
  • 8
  • 49
  • 70
  • 3
    Also note that `scan` is more appropriate than `read.csv` since the data could all be stored as numeric. – Joshua Ulrich Mar 23 '12 at 16:50
  • Or possibly even `readLines` + `writeLines` if OP just wants to combine the files into one. – Richie Cotton Mar 23 '12 at 17:01
  • @Jeff - Thanks for this detailed answer. I don't think I can combine all the files into one big one, because I need them to be separated for subsequent analysis. (Each file represents an execution of my experiment.) As for preallocating my datastructure, is there way to preallocate the size of my list (``s$scores``)? BTW- I don't see the slowup getting worse and worse, so I think the performance is dominated by disk IO. – stepthom Mar 23 '12 at 17:28
  • @JoshuaUlrich: I tried ``scan`` instead of ``read.csv``, and it appears to be running about twice as fast. For those interested, I used the command ``tmp <- matrix(scan(fileName, what=0, sep=",", quiet=T), ncol=2, byrow=TRUE)``, similar to your comment below. – stepthom Mar 23 '12 at 17:50
6

Using scan (as Joshua state in comment) could be faster (3-4 times):

scan(fileName, what=list(0L,0.0), sep=",", dec=".", quiet=TRUE)

Main difference is that scan returns list with two elements and read.csv returns data.frame.

Marek
  • 49,472
  • 15
  • 99
  • 121
  • I was thinking something like: `List[[1]] <- matrix(scan(fileName, what=0, sep=","), ncol=2, byrow=TRUE)`. – Joshua Ulrich Mar 23 '12 at 17:06
  • 1
    Thanks guys. As I mentioned in another comment, by switching to ``tmp <- matrix(scan(fileName, what=0, sep=",", quiet=T), ncol=2, byrow=TRUE)``, my code is running about twice as fast. After doing a ``system.time()``, it appears that the majority of the required time is in IO, so I guess I'll just have to wait a little bit. – stepthom Mar 23 '12 at 18:16
2

How about this general workflow? Not tested, though.

my.list.of.files <- list.files(pattern = ".txt") # char vector of filenames
my.data <- sapply(my.list.of.files, FUN = function(x) {
            # read file using scan, craft the output to two columns
         }) # result is merged

#or if you use simplify= FALSE
my.data <- sapply(my.list.of.files, FUN = function(x) {
            # read file using scan (or some other method), craft the output to two columns
         }, simplify = FALSE) #you get a list
my.data <- do.call("rbind", my.data)
Roman Luštrik
  • 69,533
  • 24
  • 154
  • 197
  • I know this is just a template, but note that if you had all 15 million files in a single directory, that may be another cause of your performance woes, as well. – Jeff Allen Mar 23 '12 at 17:08
  • 1
    Note that `sapply` is a bit slow because R has to do extra work figuring out how to simplify the output. Of the `*apply` family, `lapply` is the fastest, hence maybe more appropriate here. – flodel Mar 24 '12 at 01:08
  • @Jeff, this isn't something we can account for. @flodel, even when you specify `simplify = FALSE`? – Roman Luštrik Mar 24 '12 at 19:04
0

As Jeff mentioned, there are several things here that could take a long time. The problem could be file access, or reading in the files, or running out of memory when you have 15 million data frames in RAM. To compound the issue, the botleneck could vary depending on your machine's spec (e.g., a slow hard drive will slow down reading in the files, a lack of RAM will be a problem with a high file count). To figure out the problem, you'll have to do some profiling.

Trry just reading in 10000 or so files to begin with, and calling system.time or, more sopisticatedly, using rbenchmark to see what takes the most time.

Then look at joran's link

Quickly reading very large tables as dataframes in R

and see if any of the technique there helps you.

Community
  • 1
  • 1
Richie Cotton
  • 118,240
  • 47
  • 247
  • 360