4

I am reading in chunks if a large file (~30 GB) and have noticed that most of time is taken by performing a line count on the entire file.

Read 500000 rows and 49 (of 49) columns from 28.250 GB file in 00:01:09
   4.510s (  7%) Memory map (rerun may be quicker)
   0.000s (  0%) sep and header detection
  53.890s ( 79%) Count rows (wc -l)
   0.010s (  0%) Column type detection (first, middle and last 5 rows)
   0.120s (  0%) Allocation of 500000x49 result (xMB) in RAM
   9.780s ( 14%) Reading data
   0.000s (  0%) Allocation for type bumps (if any), including gc time if triggered
   0.000s (  0%) Coercing data already read in type bumps (if any)
   0.060s (  0%) Changing na.strings to NA
  68.370s        Total

Is it possible to specify that fread not do a full rowcount every time I read a chunk or is this a necessary step?

EDIT: Here is the exact command I am running:

fread(pfile, skip = 5E6, nrows = 5E5, sep = "\t", colClasses = rpColClasses, na.strings = c("NA", "N/A", "NULL"), head = FALSE, verbose = TRUE)
mlegge
  • 6,763
  • 3
  • 40
  • 67
  • You might try the lower level `scan`. – A. Webb Jul 30 '15 at 20:36
  • It's not clear which version you're running. I presume it's 1.9.4. Could you please try [1.9.5](https://github.com/Rdatatable/data.table/wiki/Installation)? [This commit](https://github.com/Rdatatable/data.table/commit/e15facdaac1f5d8bf89108580507972ddf5582ae) seems to handle it exactly as you mention. – Arun Sep 07 '15 at 17:06
  • I don't believe `fread` uses `wc -l` any longer, do you still have access to the file & could you re-run? – MichaelChirico Sep 11 '18 at 10:16

1 Answers1

2

I'm not sure if you can "turn off" the wc -l command in fread. That withstanding I do have two answers for you.

Answer 1: Use the Unix command split to break the large data set into chunks before calling fread. I find that knowing a bit of Unix goes a long way when handling big data sets (i.e. data that does not fit into RAM).

split -b 1m myfile.csv #breaks your file into 1mb chunks. 

Answer 2: Using connections. This approach unfortunately does not work with fread. Check out my previous post to understand what I mean by using connections. Strategies for reading in CSV files in pieces?

Community
  • 1
  • 1
Jacob H
  • 4,317
  • 2
  • 32
  • 39
  • Probably want to mention the 'append' parameter for the `write.*` functions. – IRTFM Jul 30 '15 at 23:24
  • @BondedDust I don't see how the append option in `write.*` is helpful in this context. Could you please elaborate? – Jacob H Jul 31 '15 at 02:22
  • I assumed that when you split the input into segments that you were trying to construct a large csv file that didn't fit into memory. Wrong? – IRTFM Jul 31 '15 at 02:33
  • @BondedDust I assume that mkemp6 is trying to load data in a chunk, analyze the data, remove the chunk from memory , save the results in the local work space and repeat. I don't like that mkemp6, and I might be wrong, is loading data in chunks only to recombine these chunks (maybe after culling out superfluous data). If he was doing that then yes your comment does make sense. Of course, for simple cases the Unix command `cut` is great for culling out unnecessary information from large data sets. – Jacob H Jul 31 '15 at 02:54