32

I'm trying to input a large tab-delimited file (around 2GB) using the fread function in package data.table. However, because it's so large, it doesn't fit completely in memory. I tried to input it in chunks by using the skip and nrow arguments such as:

chunk.size = 1e6
done = FALSE
chunk = 1
while(!done)
{
    temp = fread("myfile.txt",skip=(chunk-1)*chunk.size,nrow=chunk.size-1)
    #do something to temp
    chunk = chunk + 1
    if(nrow(temp)<2) done = TRUE
}

In the case above, I'm reading in 1 million rows at a time, performing a calculation on them, and then getting the next million, etc. The problem with this code is that after every chunk is retrieved, fread needs to start scanning the file from the very beginning since after every loop iteration, skip increases by a million. As a result, after every chunk, fread takes longer and longer to actually get to the next chunk making this very inefficient.

Is there a way to tell fread to pause every say 1 million lines, and then continue reading from that point on without having to restart at the beginning? Any solutions, or should this be a new feature request?

SabDeM
  • 7,050
  • 2
  • 25
  • 38
FBC
  • 993
  • 3
  • 9
  • 15
  • 1
    There's a similar FR [here](https://r-forge.r-project.org/tracker/index.php?func=detail&aid=2605&group_id=240&atid=978). I'll also link to this post. – Arun Nov 11 '13 at 08:12
  • Thanks for pointing this out and linking! Looks like a top priority FR. – FBC Nov 12 '13 at 05:05
  • I wanted to do the same thing I think it needs to be a new request. – xiaodai Dec 11 '14 at 04:36
  • 1
    @Arun Is there a FR on the new github page? I can't seem to find it – Zach Aug 20 '15 at 22:55
  • is this possible by now? – EDC Nov 22 '17 at 04:53
  • The other replies don't answer the Q as asked, which wants to use "fread in package data.table" presumably for it other great benefits. A reply which _might_ arguably come closer to the OP's intent is to suggest a solution using with Gnu parallel which can chunk the data as desired including repeating the heading row for each chunk, but it would suffer from (a) overhead of multiple R processes (b) possibly having to re-infer the data-types for each column, potentially differently between processes, (c) you gotta learn GNU parallel (well worth it). – malcook Mar 18 '18 at 08:26
  • @FBC fread now accepts text as input, so you could either use fread or readLines to read in your chunk and use that as input to fread (again). – paul_dg May 25 '19 at 13:49

4 Answers4

22

You should use the LaF package. This introduces a sort of pointer on your data, thus avoiding the - for very large data - annoying behaviour of reading the whole file. As far as I get it fread() in data.table pckg need to know total number of rows, which takes time for GB data. Using pointer in LaF you can go to every line(s) you want; and read in chunks of data that you can apply your function on, then move on to next chunk of data. On my small PC I ran trough a 25 GB csv-file in steps of 10e6 lines and extracted the totally ~5e6 observations needed - each 10e6 chunk took 30 seconds.

UPDATE:

library('LaF')
huge_file <- 'C:/datasets/protein.links.v9.1.txt'

#First detect a data model for your file:
model <- detect_dm_csv(huge_file, sep=" ", header=TRUE)

Then create a connection to your file using the model:

df.laf <- laf_open(model)

Once done you can do all sort of things without needing to know the size of the file as in data.table pckgs. For instance place the pointer to line no 100e6 and read 1e6 lines of data from here:

goto(df.laf, 100e6)
data <- next_block(df.laf,nrows=1e6)

Now data contains 1e6 lines of your CSV file (starting from line 100e6).

You can read in chunks of data (size depending on your memory) and only keep what you need. e.g. the huge_file in my example points to a file with all known protein sequences and has a size of >27 GB - way to big for my PC. To get only human sequence I filtered using organism id which is 9606 for human, and this should appear in start of the variable protein1. A dirty way is to put it into a simple for-loop and just go read one data chunk at a time:

library('dplyr')
library('stringr')

res <- df.laf[1,][0,]
for(i in 1:10){
  raw <-
    next_block(df.laf,nrows=100e6) %>% 
    filter(str_detect(protein1,"^9606\\."))
  res <- rbind(res, raw)

    }

Now res contains the filtered human data. But better - and for more complex operations, e.g. calculation on data on-the-fly - the function process_blocks() takes as argument a function. Hence in the function you do what ever you want at each piece of data. Read the documentation.

user3375672
  • 3,728
  • 9
  • 41
  • 70
  • Yeah, please se the UPDATE. – user3375672 Oct 14 '15 at 21:52
  • 2
    Thanks for this. I had an 872493862 line 61GB file and it worked reasonably fast. I tried the same looping approach with fread() using "nrows" and "skip" but it became slower and slower in each loop as it had to skip more rows. – Adam Waring Mar 16 '20 at 15:08
10

You can use readr's read_*_chunked to read in data and e.g. filter it chunkwise. See here and here for an example:

# Cars with 3 gears
f <- function(x, pos) subset(x, gear == 3)
read_csv_chunked(readr_example("mtcars.csv"), DataFrameCallback$new(f), chunk_size = 5)
Rentrop
  • 20,979
  • 10
  • 72
  • 100
8

A related option is the chunked package. Here is an example with a 3.5 GB text file:

library(chunked)
library(tidyverse)

# I want to look at the daily page views of Wikipedia articles
# before 2015... I can get zipped log files
# from here: hhttps://dumps.wikimedia.org/other/pagecounts-ez/merged/2012/2012-12/
# I get bz file, unzip to get this: 

my_file <- 'pagecounts-2012-12-14/pagecounts-2012-12-14'

# How big is my file?
print(paste(round(file.info(my_file)$size  / 2^30,3), 'gigabytes'))
# [1] "3.493 gigabytes" too big to open in Notepad++ !
# But can read with 010 Editor

# look at the top of the file 
readLines(my_file, n = 100)

# to find where the content starts, vary the skip value, 
read.table(my_file, nrows = 10, skip = 25)

This is where we start working in chunks of the file, we can use most dplyr verbs in the usual way:

# Let the chunked pkg work its magic! We only want the lines containing 
# "Gun_control". The main challenge here was identifying the column
# header
df <- 
read_chunkwise(my_file, 
               chunk_size=5000,
               skip = 30,
               format = "table",
               header = TRUE) %>% 
  filter(stringr::str_detect(De.mw.De.5.J3M1O1, "Gun_control"))

# this line does the evaluation, 
# and takes a few moments...
system.time(out <- collect(df))

And here we can work on the output as usual, since it's much smaller than the input file:

# clean up the output to separate into cols, 
# and get the number of page views as a numeric
out_df <- 
out %>% 
  separate(De.mw.De.5.J3M1O1, 
           into = str_glue("V{1:4}"),
           sep = " ") %>% 
  mutate(V3 = as.numeric(V3))

 head(out_df)
    V1                                                        V2   V3
1 en.z                                               Gun_control 7961
2 en.z Category:Gun_control_advocacy_groups_in_the_United_States 1396
3 en.z          Gun_control_policy_of_the_Clinton_Administration  223
4 en.z                            Category:Gun_control_advocates   80
5 en.z                         Gun_control_in_the_United_Kingdom   68
6 en.z                                    Gun_control_in_america   59
                                                                                 V4
1 A34B55C32D38E32F32G32H20I22J9K12L10M9N15O34P38Q37R83S197T1207U1643V1523W1528X1319
2                                     B1C5D2E1F3H3J1O1P3Q9R9S23T197U327V245W271X295
3                                     A3B2C4D2E3F3G1J3K1L1O3P2Q2R4S2T24U39V41W43X40
4                                                            D2H1M1S4T8U22V10W18X14
5                                                             B1C1S1T11U12V13W16X13
6                                                         B1H1M1N2P1S1T6U5V17W12X12

#--------------------
Ben
  • 41,615
  • 18
  • 132
  • 227
7

fread() can definitely help you read the data by chunks

What mistake you have made in your code is that you should keep your nrow a constant while you change the size of your skip parameter in the function during the loop.

Something like this is what I wrote for my data:

data=NULL

for (i in 0:20){
    data[[i+1]]=fread("my_data.csv",nrow=10000,select=c(1,2:100),skip =10000*i)   
}

And you may insert the follow code in your loop:

start_time <- Sys.time()
#####something!!!!

end_time <- Sys.time()

end_time - start_time

to check the time -- that each loop on average takes similar time.

Then you could use another loop to combine your data by rows with function default rbind function in R.

The sample code could be something like this:

new_data = data[[1]]

for (i in 1:20){
    new_data=rbind(new_data,data[[i+1]],use.names=FALSE)
}

to unify into a large dataset.

Hope my answer may help with your question.

I loaded a 18Gb data with 2k+ columns, 200k rows in about 8 minutes using this method.

user438383
  • 5,716
  • 8
  • 28
  • 43
Michael Miau
  • 71
  • 1
  • 2