32

I am trying to read a large csv file into R. I only want to read and work with some of the rows that fulfil a particular condition (e.g. Variable2 >= 3). This is a much smaller dataset.

I want to read these lines directly into a dataframe, rather than load the whole dataset into a dataframe and then select according to the condition, since the whole dataset does not easily fit into memory.

smci
  • 32,567
  • 20
  • 113
  • 146
Hernan
  • 471
  • 1
  • 4
  • 8
  • 1
    How 'large' is your file? Can't you start reading 1000 lines and subset them to your condition at every loop? – Paulo E. Cardoso Apr 21 '14 at 12:37
  • Yes, I tried that. I was reading 1 million lines at a time. Each iteration was taking about 15 seconds, including adding the resulting "filtered" dataset to an existing dataframe. But given the size of the dataset I am dealing with, this solution would have taken more than 1 hour. As I wrote below, the solution I actually used (sqldf) took slightly less than one hour. Dirk's suggestion of awk for my problem would have taken about 2 hours. I will look into Python to accelereate some of these tasks. If someone has good pointers, let me know. Thanks everyone. – Hernan Apr 22 '14 at 09:08
  • Thanks. Nice to know it worked better with sqldf. Definitely something to retain. – Paulo E. Cardoso Apr 22 '14 at 09:17
  • This is a very common question, but the 'best' answer for each use-case depends on what sort of condition, is it simple number/string-matching on one or more fields (use grep/awk), or does it require evaluating multiple fields (e.g. `V2*V3 < mean(V4) & !is.na(V5)`)? If a simple grep/awk gets 90+% of the coarse size reduction done, it's one good way to go. – smci Apr 14 '19 at 23:55

5 Answers5

36

You could use the read.csv.sql function in the sqldf package and filter using SQL select. From the help page of read.csv.sql:

library(sqldf)
write.csv(iris, "iris.csv", quote = FALSE, row.names = FALSE)
iris2 <- read.csv.sql("iris.csv", 
    sql = "select * from file where `Sepal.Length` > 5", eol = "\n")
Karsten W.
  • 17,826
  • 11
  • 69
  • 103
  • How does this solve the "file too big for current memory" problem? – Carl Witthoft Apr 21 '14 at 12:42
  • 3
    The sqlite database used under the hood is by default a temporary file, hence no memory problem, I suppose. – Karsten W. Apr 21 '14 at 12:44
  • 1
    This was the fastest way to solve my problem in R. It took about 1 hour. Thanks! – Hernan Apr 22 '14 at 09:05
  • At least in my version of R (3.4.2) running on Linux, the above example did not work without the addition of apostrophe's around `Sepal.Length`, i.e. I needed to use `\`Sepal.Length\``. – owen88 Dec 10 '17 at 09:10
26

By far the easiest (in my book) is to use pre-processing.

R> DF <- data.frame(n=1:26, l=LETTERS)
R> write.csv(DF, file="/tmp/data.csv", row.names=FALSE)
R> read.csv(pipe("awk 'BEGIN {FS=\",\"} {if ($1 > 20) print $0}' /tmp/data.csv"),
+           header=FALSE)
  V1 V2
1 21  U
2 22  V
3 23  W
4 24  X
5 25  Y
6 26  Z
R> 

Here we use awk. We tell awk to use a comma as a field separator, and then use the conditon 'if first field greater than 20' to decide if we print (the whole line via $0).

The output from that command can be read by R via pipe().

This is going to be faster and more memory-efficient than reading everythinb into R.

Dirk Eddelbuettel
  • 360,940
  • 56
  • 644
  • 725
  • 1
    And will this work with a million(s) lines DF as well? – Paulo E. Cardoso Apr 21 '14 at 15:40
  • 2
    Yes, sure. You only read the percentage satisfying the condition. `awk` processes this one line at a time. – Dirk Eddelbuettel Apr 21 '14 at 15:42
  • This is interesting but I can't get it working from RStudio with your simple example. Something with the awk filter but not sure (no lines available in input)... – Paulo E. Cardoso Apr 21 '14 at 15:49
  • You need `awk` in your path, and I assumed a Unix layout here with `/tmp`. You need to adjust that as needed. – Dirk Eddelbuettel Apr 21 '14 at 15:50
  • Sure, that works too. In fact, on my Linux system `awk` is in fact `gawk`. – Dirk Eddelbuettel Apr 21 '14 at 16:14
  • Awesome @DirkEddelbuettel could you give me the name of your book. And second I have a similar problem but with a .txt file. It has in the first 300 rows with sql errors and after of these is the data separated by tab but I tried to read it and I only got a dataframe of one column. Could you suggest me any way to solve this. Thanks. – Duck Apr 21 '14 at 17:25
  • I can run a modified string from command prompt: `"awk "BEGIN {FS=\",\"} {if ($1 > 20) print $0}" /tmp/data.csv"` works for me changing quotes from `'` to `"` but I can't find a way to run it from R nor with `pipe`or `system`. – Paulo E. Cardoso Apr 21 '14 at 21:26
  • I cut and pasted the actual command from a working session. I use outer "" for the actual string R passes to as an argument to `pipe()`. Inside the string, use single quotes for the argument to awk. Inside the single quote, escape quotes for the FS argument. – Dirk Eddelbuettel Apr 21 '14 at 21:29
  • @DirkEddelbuettel This is very helpful! Where can I find more info about `awk`? – Shambho Apr 21 '14 at 22:15
  • Hi @DirkEddelbuettel, Thanks so much for this answer. I was not familiar with awk but this works. I accepted the answer below because it ended up being a bit faster for my particular problem (no idea why). But I wanted to thank for always teaching us newbies something useful. I bought your rcpp book last week. ;) – Hernan Apr 22 '14 at 09:04
  • Another external tool `csvtk` from https://bioinf.shenwei.me/csvtk may also be tried. – San Aug 09 '21 at 10:24
20

I was looking into readr::read_csv_chunked when I saw this question and thought I would do some benchmarking. For this example, read_csv_chunked does well and increasing the chunk size was beneficial. sqldf was only marginally faster than awk.

library(tidyverse)
library(sqldf)
library(data.table)
library(microbenchmark)

# Generate an example dataset with two numeric columns and 5 million rows
tibble(
  norm = rnorm(5e6, mean = 5000, sd = 1000),
  unif = runif(5e6, min = 0, max = 10000)
) %>%
  write_csv('medium.csv')

microbenchmark(
  readr  = read_csv_chunked('medium.csv', callback = DataFrameCallback$new(function(x, pos) subset(x, unif > 9000)), col_types = 'dd', progress = F),
  readr2 = read_csv_chunked('medium.csv', callback = DataFrameCallback$new(function(x, pos) subset(x, unif > 9000)), col_types = 'dd', progress = F, chunk_size = 1000000),
  sqldf  = read.csv.sql('medium.csv', sql = 'select * from file where unif > 9000', eol = '\n'),
  awk    = read.csv(pipe("awk 'BEGIN {FS=\",\"} {if ($2 > 9000) print $0}' medium.csv")),
  awk2   = read_csv(pipe("awk 'BEGIN {FS=\",\"} {if ($2 > 9000) print $0}' medium.csv"), col_types = 'dd', progress = F),
  fread  = fread(cmd = "awk 'BEGIN {FS=\",\"} {if ($2 > 9000) print $0}' medium.csv"),
  check  = function(values) all(sapply(values[-1], function(x) all.equal(values[[1]], x))),
  times  = 10L
)

# Updated 2020-05-29

# Unit: seconds
#   expr   min    lq  mean  median    uq   max neval
#  readr   2.6   2.7   3.1     3.1   3.5   4.0    10
# readr2   2.3   2.3   2.4     2.4   2.6   2.7    10
#  sqldf  14.1  14.1  14.7    14.3  15.2  16.0    10
#    awk  18.2  18.3  18.7    18.5  19.3  19.6    10
#   awk2  18.1  18.2  18.6    18.4  19.1  19.4    10
#  fread  17.9  18.0  18.2    18.1  18.2  18.8    10

# R version 3.6.2 (2019-12-12)
# macOS Mojave 10.14.6        

# data.table 1.12.8
# readr      1.3.1 
# sqldf      0.4-11
Eric
  • 3,403
  • 1
  • 19
  • 18
  • 2
    Do you mind adding `data.table::fread(cmd="awk ...")` for benchmarking? – zx8754 May 13 '20 at 12:57
  • 1
    @zx8754 of course! Though, for this example `awk` does the heavy lifting. Is there a chunked reading capability in `data.table`? – Eric May 30 '20 at 04:59
  • 1
    I think another external tool called `csvtk` from https://bioinf.shenwei.me/csvtk/ may also be used with `fread` for testing. – San Aug 09 '21 at 10:22
9

You can read the file in chunks, process each chunk, and then stitch only the subsets together.

Here is a minimal example assuming the file has 1001 (incl. the header) lines and only 100 will fit into memory. The data has 3 columns, and we expect at most 150 rows to meet the condition (this is needed to pre-allocate the space for the final data:

# initialize empty data.frame (150 x 3)
max.rows <- 150
final.df <- data.frame(Variable1=rep(NA, max.rows=150), 
                       Variable2=NA,  
                       Variable3=NA)

# read the first chunk outside the loop
temp <- read.csv('big_file.csv', nrows=100, stringsAsFactors=FALSE)
temp <- temp[temp$Variable2 >= 3, ]  ## subset to useful columns
final.df[1:nrow(temp), ] <- temp     ## add to the data
last.row = nrow(temp)                ## keep track of row index, incl. header

for (i in 1:9){    ## nine chunks remaining to be read
  temp <- read.csv('big_file.csv', skip=i*100+1, nrow=100, header=FALSE,
                   stringsAsFactors=FALSE)
  temp <- temp[temp$Variable2 >= 3, ]
  final.df[(last.row+1):(last.row+nrow(temp)), ] <- temp
  last.row <- last.row + nrow(temp)    ## increment the current count
}

final.df <- final.df[1:last.row, ]   ## only keep filled rows
rm(temp)    ## remove last chunk to free memory

Edit: Added stringsAsFactors=FALSE option on @lucacerone's suggestion in the comments.

ilir
  • 3,236
  • 15
  • 23
  • out of curiousity: say that while importing I realize that preallocating 150 rows wasn't enough, is there an efficient way to extend the rows of the final data.frame (say by other 150 rows)? – lucacerone Apr 21 '14 at 13:24
  • 2
    just one small not: I would use the option stringsAsFactors = FALSE when you import the data: it might be that for categorical variables you don't read all the categories with the first chunks, and R doesn't make you add data for which you don't have categories... – lucacerone Apr 21 '14 at 13:31
  • Good catch, thanks! I will edit the solution. To my knowledge there is no efficient way to add lines without R rewriting the whole thing, but if you do extend it (say by another 50) it will still only be one rewrite. – ilir Apr 21 '14 at 13:40
  • I think stringsAsFactors = FALSE should be added to final.df too. I am surprised though that there is not base function to let you do something similar... – lucacerone Apr 21 '14 at 17:29
  • I don't think it's necessary. It will be passed as a `character` value and will not conflict with the other chunks, also `character`. – ilir Apr 21 '14 at 17:42
2

You can open the file in read mode using the function file (e.g. file("mydata.csv", open = "r")).

You can read the file one line at a time using the function readLines with option n = 1, l = readLines(fc, n = 1).

Then you have to parse your string using function such as strsplit, regular expressions, or you can try the package stringr (available from CRAN).

If the line met the conditions to import the data, you import it.

To summarize I would do something like this:

df = data.frame(var1=character(), var2=int(), stringsAsFactors = FALSE)
fc = file("myfile.csv", open = "r")

i = 0
while(length( (l <- readLines(fc, n = 1) ) > 0 )){ # note the parenthesis surrounding l <- readLines..

   ##parse l here: and check whether you need to import the data.

   if (need_to_add_data){
     i=i+1
     df[i,] = #list of data to import
  }

}
lucacerone
  • 9,859
  • 13
  • 52
  • 80
  • This will work, but it'll be relatively slow. It's almost always fastest to edit the source file before reading into `R`, e.g. using a simple text editor or tools like `sed` and `awk` – Carl Witthoft Apr 21 '14 at 12:41
  • So he did, but sometimes ... "Let It Bleed," Track 9 . – Carl Witthoft Apr 21 '14 at 12:44
  • 3
    You could make this a lot faster by reading in (say) 10,000 lines at a time. – hadley Apr 21 '14 at 13:00
  • 1
    @hadley beat me to it. You can read it by chunks. You should also pre-allocate `df` or it will take very long, basically rewriting the entire data on each iteration (millions of times). I've added a proof-of-concept solution. – ilir Apr 21 '14 at 13:09