4

The problem:

Data sets take 6-12 hours to load into R. Much larger data sets are coming, and my current import process clearly isn't ready for them. Once it's all in a data frame the size isn't a problem; most operations take only a few seconds, so my hardware probably isn't the issue.

Note: This question is not a duplicate of similar questions because I have already implemented most of the advice from related threads, e.g. specify colClasses.

The data:

Rows in the tab-delimitted text files look like this:

20  -0.5    1   2   1   1   19  0   119 30  exp(-31.3778)

Loading the data:

I have defined a couple functions that together loop over the files to load the data into a single data frame then save it as a blob. This is the process that takes hours. The process predictably slows down and uses more memory as it progresses; top indicates that R is using > 95% of the CPU and (more importantly?) > 1.5 GB of real memory by the time it's halfway through the data files.

# get numeric log from character data
extract_log <- function(x) {
  expr <- "exp\\((.*)\\)"
  substring <- sub(expr, "\\1", x)
  log <- as.numeric(substring)
  return(log)

# reads .dat files into data frames
read_dat <- function(x, colClasses = c(rep("numeric", 10), "character")) {
  df <- read.table(x, header = TRUE, sep = "\t", comment.char = "",
                   colClasses = colClasses)
  df <- cbind(df, log_likelihood = sapply(df$likelihood, extract_log))
  df$likelihood <- exp(df$log_likelihood)
  # drop nat. log col, add log10 column shifting data to max = 0
  df <- transform(df,
                  rlog_likelihood = log10(likelihood) - max(log10(likelihood)))
  return(df)
}

# creates a single data frame from many .dat files
df_blob <- function(path = getwd(), filepattern = "*.dat$",
                    outfile = 'df_blob.r', ...) {
  files <- list.files(path = path, pattern = filepattern, full.names = TRUE)
  progress_bar <- {
    txtProgressBar(min = 0, max = length(files),
                    title = "Progress",
                    style = 3)
  }
  df <- read_dat(files[1])
  setTxtProgressBar(progress_bar, 1)
  for (f in 2:length(files)) {
    df <- rbind(df, read_dat(files[f]))
    setTxtProgressBar(progress_bar, f)
  }
  close(progress_bar)
  save(df, file = outfile)
}

The Solution

Time required has been reduced from hours to seconds.

  1. Concatenate the data files with a shell script (time required ~12 seconds)
  2. Load the concatenated file with sqldf (time required ~6 seconds)

Concatenate the data files with a shell script (time required ~12 seconds) and then load them with sqldf() exactly as described in JD Long's answer to a related question and as described in his blog post.

Lessons learned

Comments by Justin and Joran significantly improved the efficiency of my read.table() approach and for smaller data sets, that approach should work fine. In particular Justin's advice to replace the looping of rbind(df, read_dat(files[f])) over files with do.call(rbind, lapply(files, read_dat)) cut the execution time by about 2/3. Improvements from other suggestions were more modest though still worth while.

Community
  • 1
  • 1
Gregory
  • 4,147
  • 7
  • 33
  • 44
  • 5
    The step `df <- rbind(df, read_dat(files[f]))` is part of the issue. Replacing that with `df <- do.call(rbind, lapply(files, read_dat))` may help significantly. – Justin Nov 08 '13 at 16:08
  • 4
    Also, I would seriously investigate the cost of doing all those transformations on each individual piece inside `read_dat`, rather than all at once using something like `data.table`. – joran Nov 08 '13 at 16:11
  • Have you run `profile` on a (slightly smaller) set of data to see which operations are the big hitters? – Carl Witthoft Nov 08 '13 at 16:16
  • 3
    You should try using `require(data.table); rbindlist(lapply(files, fread))` for file reading and then follow @joran's advice. – Roland Nov 08 '13 at 16:16
  • This post is relevant: [quickly-reading-very-large-tables-as-dataframes-in-r](http://stackoverflow.com/questions/1727772/). – gung - Reinstate Monica Nov 08 '13 at 16:16
  • @Justin, what is the difference between rbind() and do.call(rbind)? – Gregory Nov 08 '13 at 17:13
  • `rbindlist` and the `do.call(rbind(...` methods both pre-allocate the result in memory. This is generally much more efficient than "growing" the data frame with `rbind` by itself. – Justin Nov 08 '13 at 17:21
  • @Justin, the do.call(rbind) method appears to be taking about 1/3 the time that rbind() did. I'll try the sqldf method next and accept an answer based on the results. – Gregory Nov 08 '13 at 18:43
  • `sapply(df$likelihood, extract_log)` can be replaced with `extract_log(df$likelihood)`; `do.call(rbind, lapply(...))` will become more performant as the number of files grows; that 1/3rd might severely underestimate the speed-up you'll see over all your files. – Martin Morgan Nov 08 '13 at 21:22

1 Answers1

6

The fundamental A big problem that you have is that read.table isn't very fast. You can tweak it by setting colClasses and nrows, but at the end of the day, if your data takes 12 hours to load, you need to use different technology.

A faster approach is to import your data into a database and then read it into R. JD Long demonstrates a method using a sqlite database and the sqldf package in this answer. MonetDB and the MonetDB.R package are designed for doing this sort of thing very quickly and are worth investigating.


As Justin and Joran both spotted, incrementally growing a data frame in a loop using rbind(df, read_dat(files[f])) is a huge bottleneck. Where the full dataset fits in RAM, a far better appraoch is to use do.call(files, read.table). (Where it doesn't, use the above method of erading everything into a database and just pulling what you need into R.)

Community
  • 1
  • 1
Richie Cotton
  • 118,240
  • 47
  • 247
  • 360
  • 2
    I think those are good suggestions, but I'm not sure I agree that his fundamental problem is really `read.table`. I would argue that his current bottlenecks are (in descending order): growing the df via `rbind`, calling `transform` 7000 times (i.e. doing all the transformations repeatedly rather than once) and finally touching his disk 7000 times. I wouldn't be surprised if a good solution using read.table was at least usably fast. – joran Nov 08 '13 at 16:37
  • I'll try both approaches, i.e. suggestions for speeding up read.table and Richie's suggestion of reading from a database. The frequent disk access might not be a huge issue as I'm using a fast new SSD. – Gregory Nov 08 '13 at 17:12
  • @joran Agreed. I missed the line about growing `df` in a loop. Well spotted. – Richie Cotton Nov 09 '13 at 23:35