2

I'm having problems for counting the number of lines in a messy csv.bz2 file.

Since this is a huge file I want to be able to preallocate a data frame before reading the bzip2 file with the read.csv() function.

As you can see in the following tests, my results are widely variable, and none of the correspond with the number of actual rows in the csv.bz2 file.

> system.time(nrec1 <- as.numeric(shell('type "MyFile.csv" | find /c ","', intern=T)))
   user  system elapsed 
   0.02    0.00   53.50 
> nrec1
[1] 1060906
> system.time(nrec2 <- as.numeric(shell('type "MyFile.csv.bz2" | find /c ","', intern=T)))
   user  system elapsed 
   0.00    0.02   10.15 
> nrec2
[1] 126715
> system.time(nrec3 <- as.numeric(shell('type "MyFile.csv" | find /v /c ""', intern=T)))
   user  system elapsed 
   0.00    0.02   53.10 
> nrec3
[1] 1232705
> system.time(nrec4 <- as.numeric(shell('type "MyFile.csv.bz2" | find /v /c ""', intern=T)))
   user  system elapsed 
   0.00    0.01    4.96 
> nrec4
[1] 533062

The most interesting result is the one I called nrec4 since it takes no time, and it returns roughly half the number of rows of nrec1, but I'm totally unsure if the naive multiplication by 2 will be ok.

I have tried several other methods including fread() and hsTableReader() but the former crashes and the later is so slow that I won't even consider it further.

My questions are:

  1. Which reliable method can I use for counting the number of rows in a csv.bz2 file?
  2. It's ok to use a formula for calculating the number of rows directly in a csv.bz2 file without decompressing it?

Thanks in advance,

Diego

Diego
  • 824
  • 2
  • 9
  • 18
  • 1
    Why do you think you need to preallocate a data.frame for `read.csv`? – Roland May 27 '14 at 10:39
  • I don't have to. But when you preallocate your data frames everything works a lot faster and consumes less resources. – Diego May 27 '14 at 10:42
  • Are you reading the file in chunks using a loop? Because otherwise this doesn't make sense. `scan` preallocates internally. – Roland May 27 '14 at 10:48
  • No loops. Why do you say this doesn't make sense? AFAIK R can't read the whole file at once but in chunks, and in my tests when I preallocate the data frame I can see the difference ??? – Diego May 27 '14 at 10:55
  • Please show in your question what *exactly* you are doing. I really can't tell. – Roland May 27 '14 at 10:58
  • Thank you Roland, but I don't know what else to show. As I said, I'm just using the read.csv() function which takes too long for importing a csv.bz2 file and want to improve the performance by preallocating my data frame... and for that (and even for the performance of scan) I need to know the number of rows... So that's my question, how can I effectively know the number of rows in a csv.bz2 file using the shell. – Diego May 27 '14 at 11:01
  • `read.csv` has a `colClasses` parameter. Use it. If it's still too slow, you can try using the `nrows` parameter. There is no reason to preallocate a data.frame. [This FAQ](http://stackoverflow.com/q/1727772/1412059) is relevant. – Roland May 27 '14 at 11:07
  • I use that parameter Roland. I don't see how the post you linked is relevant to my question... maybe if you explain further I can understand. But well, I disagree with the notion that is no reason to preallocate a data frame. As I said I have done the tests and can really tell things work a lot faster with less overheading. Thanks again. – Diego May 27 '14 at 11:13
  • Without seeing the exact benchmarking code I cannot comment on your claims. I find it hard to believe them. – Roland May 27 '14 at 11:15
  • Don't trust me... do some tests with a lot of data and you'll see it by yourself :) – Diego May 27 '14 at 11:17
  • 1
    I trust R-core, the huge community that imports data into R and my own experience. Thus, I don't believe you until you show proof. Preallocating a data.frame should absolutely not be necessary unless you are doing something you don't show (such as `rbind`ing data.frames). But since you show nothing ... – Roland May 27 '14 at 11:20
  • I'm with @Roland. Pre-allocating the data frame should make no difference at all. What will make a difference is sending `nrows=...` to `read.csv()`. It's possible that this is what you are doing (and mean) but it's not clear from your question. From the help: `Using nrows, even as a mild over-estimate, will help memory usage.` So, in my view, use the fastest way to get a rough over-estimate of `nrows` and use that. – Andrie May 27 '14 at 11:47
  • Fair enough Andrie. And you got perfectly what I'm doing. Maybe I'm wrong, so correct me in this: when you use nrows IMO you are actually preallocating a data frame even if you didn't initialize it since read.csv() actually reads into a data frame the contents of a csv file. But yeah I was initializing my data frame and using nrows so I couldn't tell exactly what was the part that was optimizing the performance. Thanks for that. What do you think about the naive multiplication by 2? Is this something I can rely upon? – Diego May 27 '14 at 18:29

1 Answers1

1

Roland was right from the beginning.
When using the garbage collector, the illusion of improved performance still remained.
I had to close and re-start R for doing an accurate test.
Yes, the process is still a little bit faster by few seconds (red line), and the increase in RAM consumption is more uniform when using nrows.
But at least in this case is not worthy the effort of trying to find an optimization for the read.csv() function.
It is slow but it is what it is.
If someone know about a faster approach I'm interested.
fread() crashes just in case.

Thanks.


Without nrows (Blue Line)

Sys.time()
system.time(storm.data <- read.csv(fileZip, 
                   header = TRUE,
                   stringsAsFactors = F,
                   comment.char = "",
                   colClasses = "character"))
Sys.time()
rm(storm.data)
gc()

With nrows (Red Line)

Sys.time()
system.time(nrec12 <- as.numeric(
  shell('type "MyFile.csv.bz2" | find /v /c ""',
    intern=T)))

nrec12 <- nrec12 * 2
system.time(storm.data <- read.csv(fileZip, 
                   stringsAsFactors = F,
                   comment.char = "", 
                   colClasses = "character",
                   nrows = nrec12))
Sys.time()
rm(storm.data)
gc()

Time spent comparison

Ram consumption comparison

Diego
  • 824
  • 2
  • 9
  • 18