6

I have a 14GB data.txt file. I was comparing the speed of fread and read.table by reading the first 1M rows. It looks like fread is much slower although it is not supposed to be. It takes some time until the percentage counts show up.

What could be the reason? I thought it was supposed to be super fast... I am using a Windows OS computer.

Matt Dowle
  • 58,872
  • 22
  • 166
  • 224
KTY
  • 709
  • 1
  • 9
  • 17
  • 2
    Define "much slower" - if it's measured in microseconds then I wouldn't be losing sleep. Also, without example code noone can verify what you're doing. – thelatemail Aug 28 '15 at 05:04
  • @thelatemail: I have a data table 100M rows, 60 columns. This is 14 GB. When I read first 1M rows, it takes 1.5-2 mins (there is a wait time until the percentage count shows) whereas read.table takes less than a minute. Irrespective of this comparison, I have been hearing from others that `fread` is reading their 4GB table in 40 sec. There is something wrong that I can't figure out. – KTY Aug 28 '15 at 05:13
  • This is the code I use: `data=read.table('data.txt',sep=',',nrow=1000000,header=TRUE,stringsAsFactors=FALSE) data=fread('data.txt',sep=',',nrow=1000000)` – KTY Aug 28 '15 at 05:14

1 Answers1

9

fread mmaps the file. This takes some time, and will map the whole file. This means subsequent "read-ins" will be faster.

read.table does not mmap the whole file. It can read in the file line by line [and stop at line 1000000].

You can see some background on mmap at mmap() vs. reading blocks

The examples in the help from fread highlight this behaiviour

Community
  • 1
  • 1
mnel
  • 113,303
  • 27
  • 265
  • 254
  • So if I will read the file only once, can we say that using `fread` won't give much of an advantage? – KTY Aug 28 '15 at 05:22
  • @KTY, if you are only trying to read in the first million lines, and only once, then you may have found a case where fread won't give and advantage. If you want to read the whole file, or read the rest of the lines in subsequently, then `fread` should almost definitely be faster. – mnel Aug 28 '15 at 05:24
  • 2
    yes, it seems like the main difference comes when reading big files...now reading the whole 14GB file, it is very fast compared to `read.table`. Thanks for the information on `mmap`. – KTY Aug 28 '15 at 05:37
  • 3
    @KTY We could speed up reading the first N rows. Just wasn't a priority as normally you want to read the whole file. I filed a feature request [#1300](https://github.com/Rdatatable/data.table/issues/1300). – Matt Dowle Aug 28 '15 at 22:43