4

I have tab-delimited tables in text files where all lines end with \r\r\n (0x0D 0x0D 0x0A). If I try to read such file with fread(), it says

Line ending is \r\r\n. R's download.file() appears to add the extra \r in text mode on Windows. Please download again in binary mode (mode='wb') which might be faster too. Alternatively, pass the URL directly to fread and it will download the file in binary mode for you.

but I am not downloading these files, I already have them.

So far I came to the solution which first reads the file with read.table() (it treats \r\r\n combination as a single end-of-line character), then converts resulting data.frame by data.table():

mydt <- data.table(read.table(myfilename, header = T, sep = '\t', fill = T))

but I am wondering if there's any way to avoid slow read.table() and use fast fread() instead.

smci
  • 32,567
  • 20
  • 113
  • 146
Vasily A
  • 8,256
  • 10
  • 42
  • 76

1 Answers1

6

I suggest using the GNU utility tr to get rid of those unnecessary \r characters. e.g.

cat("a,b,c\r\r\n1, 2, 3\r\r\n4, 5, 6", file = "test.csv")
fread("test.csv")
## Error in fread("test.csv") : 
##  Line ending is \r\r\n. R's download.file() appears to add the extra \r in text mode on Windows. Please download again in binary mode (mode='wb') which might be faster too. Alternatively, pass the URL directly to fread and it will download the file in binary mode for you.

system("tr -d '\r' < test.csv > test2.csv")
fread("test2.csv")
##    a b c
## 1: 1 2 3
## 2: 4 5 6

If you are using Windows and do not have the tr utility, you can get it here.

Added:

I did some comparisons of three methods, using a 100,000 x 5 sample cvs dataset.

  • OPcsv is the "slow" read.table method
  • freadScan is a method that discards the extra \r characters in pure R
  • freadtr calls GNU tr through the shell using fread() directly.

The third method is by far the fastest.

# create a 100,000 x 5 sample dataset with lines ending in \r\r\n
delim <- "\r\r\n"
sample.txt <- paste0("a, b, c, d, e", delim)
for (i in 1:100000) {
    sample.txt <- paste0(sample.txt,
                        paste(round(runif(5)*100), collapse = ","),
                        delim)
}
cat(sample.txt, file = "sample.csv")


# function that translates the extra \r characters in R only
fread2 <- function(filename) {
    tmp <- scan(file = filename, what = "character", quiet = TRUE)
    # remove empty lines caused by \r
    tmp <- tmp[tmp != ""]
    # paste lines back together together with \n character
    tmp <- paste(tmp, collapse = "\n")
    fread(tmp)
}

# OP function using read.csv that is slow
readcsvMethod <- function(myfilename)
    data.table(read.table(myfilename, header = TRUE, sep = ',', fill = TRUE))

require(microbenchmark)
microbenchmark(OPcsv = readcsvMethod("sample.csv"),
               freadScan = fread2("sample.csv"),
               freadtr = fread("tr -d \'\\r\' < sample.csv"),
               unit = "relative")
## Unit: relative
##           expr      min       lq     mean   median       uq      max neval
##          OPcsv 1.331462 1.336524 1.340037 1.365397 1.366041 1.249223   100
##      freadScan 1.532169 1.581195 1.624354 1.673691 1.676596 1.355434   100
##        freadtr 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000   100
Ken Benoit
  • 14,454
  • 27
  • 50
  • thanks for the solution, but ideally I would like to find a way without using external utilities (and yes, I'm under Windows). – Vasily A Oct 26 '15 at 07:47
  • OK, I added an R-only method plus comparisons. See edited answer. – Ken Benoit Oct 26 '15 at 18:43
  • 1
    You can put that system call into `fread()`. Do `fread("cat test.csv | tr -d '\r'")` and you can skip the `system()` step, similar to [this](http://stackoverflow.com/questions/29499145/preventing-column-class-inference-in-fread/29499512#29499512). The quotes around `\r` may not be necessary. – Rich Scriven Oct 26 '15 at 18:51
  • Thanks @RichardScriven, I had just figured out that the `fread()` can take shell commands directly. But it didn't work for me without quotes around the `\r`. – Ken Benoit Oct 26 '15 at 19:11
  • @DirtySockSniffer good tip, it should be added though that not only the quotes seem to be necessary you also need to double escape the backslash i.e. `fread("cat f.csv | tr - d '\\r'")`. That was the only way I got it to work on Linux – posdef Aug 22 '16 at 14:13