read.csv is extremely slow in reading csv files with large numbers of columns

Question

I have a .csv file: example.csv with 8000 columns x 40000 rows. The csv file have a string header for each column. All fields contains integer values between 0 and 10. When I try to load this file with read.csv it turns out to be extremely slow. It is also very slow when I add a parameter nrow=100. I wonder if there is a way to accelerate the read.csv, or use some other function instead of read.csv to load the file into memory as a matrix or data.frame?

Thanks in advance.

please share the code you are using to read.csv - there are a lot of options for improving performance, see ?read.table — mdsumner, Sep 07 '11 at 01:33

score 19 · Answer 1 · answered Sep 07 '11 at 01:18

19

If your CSV only contains integers, you should use scan instead of read.csv, since ?read.csv says:

 ‘read.table’ is not the right tool for reading large matrices,
 especially those with many columns: it is designed to read _data
 frames_ which may have columns of very different classes.  Use
 ‘scan’ instead for matrices.

Since your file has a header, you will need skip=1, and it will probably be faster if you set what=integer(). If you must use read.csv and speed / memory consumption are a concern, setting the colClasses argument is a huge help.

answered Sep 07 '11 at 01:18

Joshua Ulrich

173,410
32
338
418

1

You can add the names of your columns back by reading the single line of he header as a vector with the `readLines()` function and modifying the column names of your matrix. – John Sep 07 '11 at 02:04
1

Thanks. I just found another wrapper function make use of scan(): read.matrix function in tseries package. It claims that it is faster than read.csv. – rninja Sep 07 '11 at 02:40

score 17 · Answer 2 · edited Jul 21 '21 at 21:31

17

Try using data.table::fread(). This is by far on of the fastest ways to read .csv files into R. There is a good benchmark here.

library(data.table)

data <- fread("c:/data.csv")

If you want to make it even faster, you can also read only the subset of columns you want to use:

data <- fread("c:/data.csv", select = c("col1", "col2", "col3"))

edited Jul 21 '21 at 21:31

Michael

5,808
4
30
39

answered May 03 '16 at 10:31

rafa.pereira

13,251
6
71
109

1

fread crashes instantly on my data (has a little over a million columns) – shrgm Oct 01 '17 at 19:13
1

This is strange; I would recommend you uninstall and reinstall the library: ´remove.packages("data.table") ; install.packages("data.table")´ . If the problem persists, you might wanna consider opening an ´issue´ on the project website https://github.com/Rdatatable/data.table/wiki – rafa.pereira Oct 02 '17 at 13:06
Instant crashing seem to suggest you don't have sufficient free memory to read the data. – syockit Aug 27 '20 at 02:09

score 6 · Answer 3 · edited Oct 29 '16 at 01:33

6

Also try Hadley Wickham's readr package:

library(readr) 
data <- read_csv("file.csv")

edited Oct 29 '16 at 01:33

amc

263
3
19

answered May 03 '16 at 11:28

Cyrus Mohammadian

4,982
6
33
62

Tommy · Answer 4 · 2011-09-07T02:11:03.603

4

If you'll read the file often, it might well be worth saving it from R in a binary format using the save function. Specifying compress=FALSE often results in faster load times.

...You can then load it in with the (surprise!) load function.

d <- as.data.frame(matrix(1:1e6,ncol=1000))
write.csv(d, "c:/foo.csv", row.names=FALSE)

# Load file with read.csv
system.time( a <- read.csv("c:/foo.csv") ) # 3.18 sec

# Load file using scan
system.time( b <- matrix(scan("c:/foo.csv", 0L, skip=1, sep=','), 
                         ncol=1000, byrow=TRUE) ) # 0.55 sec

# Load (binary) file using load
save(d, file="c:/foo.bin", compress=FALSE)
system.time( load("c:/foo.bin") ) # 0.09 sec

edited Sep 07 '11 at 02:11

answered Sep 07 '11 at 01:27

Tommy

39,997
12
90
85

2

Whether compression speeds things depends on multiple factors and can be tested on a /file /machine basis. HD speed, CPU speed, and degree of compression achieved all contribute to whether the compressed or uncompressed file is faster to load. But in general, uncompressed can be faster when drive speed is good and CPU speed isn't while the opposite is true for compressed. For example, I'd tend to want to use compressed writing to USB flash drives on a fast laptop. – John Sep 07 '11 at 02:02

score 2 · Answer 5 · edited Jun 20 '20 at 09:12

Might be worth it to try the new vroom package

vroom is a new approach to reading delimited and fixed width data into R.

It stems from the observation that when parsing files reading data from disk and finding the delimiters is generally not the main bottle neck. Instead (re)-allocating memory and parsing the values into R data types (particularly for characters) takes the bulk of the time.

Therefore you can obtain very rapid input by first performing a fast indexing step and then using the ALTREP (ALTernative REPresentations) framework available in R versions 3.5+ to access the values in a lazy / delayed fashion.

This approach potentially also allows you to work with data that is larger than memory. As long as you are careful to avoid materializing the entire dataset at once it can be efficiently queried and subset.

#install.packages("vroom", 
#                 dependencies = TRUE, repos = "https://cran.rstudio.com")
library(vroom)

df <- vroom('example.csv')

Benchmark: readr vs data.table vs vroom for a 1.57GB file

read.csv is extremely slow in reading csv files with large numbers of columns

5 Answers5

Linked