1

I am importing many (> 300) .csv-files in a project, and I stumbled upon a very strange occurence.

There is a noticable difference in size when comparing the results of read_csv and read.csv. Windows lists the file size of all files to be ~442 MB.

Using readr

library(tidyverse)

datadir <- "Z:\\data\\attachments"
list_of_files <- list.files(path = datadir, full.names = TRUE)

readr_data <- lapply(list_of_files, function(x) {
  read_csv(x, col_types = cols())
})

object.size(readr_data)
#> 416698080 bytes

str(readr_data[1])
#> List of 1
#>  $ : tibble [2,123 x 80] (S3: spec_tbl_df/tbl_df/tbl/data.frame)

Using base methods

base_data <- lapply(list_of_files, function(x) {
  read.csv(x)
})


object.size(base_data)
#> 393094616 bytes
str(base_data[1])
#> List of 1
#>  $ :'data.frame':    2123 obs. of  80 variables:
# Compare size
object.size(readr_data) / object.size(base_data) * 100
#> 106 bytes

Now 6% may not be that much, but that is still 23 MB, and I am still interested in why these are different. Additionally, both of these are smaller than that reported by Windows.

Why are the lists of different size, and is that important?

EDIT: Apparently some of the classes are different. I used this method:

readr_class <- sapply(readr_data[[1]], class)

base_class <- sapply(base_data[[1]], class)
result <- data.frame(readr_class, base_class)

And these are the differences:

                 readr_class base_class
var1              numeric    integer
var2              numeric    integer
var3              numeric    integer
var4              character  integer
mhovd
  • 3,724
  • 2
  • 21
  • 47
  • One has row numbers and the other does not? – G5W Nov 06 '20 at 14:16
  • 1
    What are the column classes in each? Posting `str(data)` and `str(data_)` (or comparing them yourself) would lend a lot of insight. – Gregor Thomas Nov 06 '20 at 14:17
  • Some columns might have been read as integer or logical (4 byte only) in one tool but not in the other. Did you check storage mode of fields you read with both tools? Or maybe read_csv keeps some metadata in attributes. – jangorecki Nov 06 '20 at 14:17
  • 1
    *"both of these are smaller than that reported by Windows."* Yes - CSV is not an efficient storage type. R data structures have some optimization in place for repeated strings, are aware classes for the numbers, etc., and should be expected to be at least a little smaller than the CSV representation. – Gregor Thomas Nov 06 '20 at 14:18
  • 1
    A minor note: I'd suggest renaming you sample data frames for clarity - `readr_data` and `base_data` would be self-explanatory names. – Gregor Thomas Nov 06 '20 at 14:20
  • @G5W That's a good though, but I think row names would go the other way... `object.size(mtcars)` is 7208 and `object.size(as_tibble(mtcars))` is 4960 - the tibble with simple row names is smaller, but OP is seeing a change in the opposite direction. – Gregor Thomas Nov 06 '20 at 14:27
  • 1
    Thank you for the suggestion @GregorThomas. I will also upload the relevant output of `str()` for both cases. – mhovd Nov 06 '20 at 14:29
  • If your data has many columns, feel free to only post the differences and just note *"the other 200 columns are the same in both versions"* or whatever. – Gregor Thomas Nov 06 '20 at 14:31
  • Are there any quick methods to find the differences in column type, @GregorThomas? I have 80 columns. – mhovd Nov 06 '20 at 14:31
  • Something like `all.equal(lapply(base_data, class), lapply(readr_data, class))` – Gregor Thomas Nov 06 '20 at 14:44

1 Answers1

3

Selecting the right functions is of course very important for writing efficient code. The degree of optimization present in different functions and packages will impact how objects are stored, their size, and the speed of operations running on them. Please consider the following.

library(data.table)
a <- c(1:1000000)
b <- rnorm(1000000)
mat <- as.matrix(cbind(a, b))
df <- data.frame(a, b)
dt <- data.table::as.data.table(mat)
cat(paste0("Matrix size: ",object.size(mat), "\ndf size: ", object.size(df), " (",round(object.size(df)/object.size(mat),2) ,")\ndt size: ", object.size(dt), " (",round(object.size(dt)/object.size(mat),2),")" ))
Matrix size: 16000568
df size: 12000848 (0.75)
dt size: 4001152 (0.25)

So here already you see that data.table stores the same data using 4 times less space than your old matrix does, and 3 times less than data.frame. Now about operations speed:

> microbenchmark(df[df$a*df$b>500,], mat[mat[,1]*mat[,2]>500,], dt[a*b>500])
Unit: milliseconds
                             expr       min        lq     mean   median        uq      max neval
          df[df$a * df$b > 500, ] 23.766201 24.136201 26.49715 24.34380 30.243300  32.7245   100
 mat[mat[, 1] * mat[, 2] > 500, ] 13.010000 13.146301 17.18246 13.41555 20.105450 117.9497   100
                  dt[a * b > 500]  8.502102  8.644001 10.90873  8.72690  8.879352 112.7840   100

data.table does the filtering 1.7 times faster than base on data.frame, and 2.5 times faster than using a matrix.

And that's not all, for almost any CSV import, using data.table::fread will change your life. Give it a try instead of read.csv or read_csv.

IMHO data.table doesn't get half the love it deserves, the best all-round package for performance and a very concise syntax. The following vignettes should put you on your way quickly, and that is worth the effort, trust me.

For further performance improvements Rfast contains many Rcpp implementations of popular functions and problems, such as rowSort() for example.


EDIT: fread's speed is due to optimizations done at C-code level involving the use of pointers for memory mapping, and coerce-as-you-go techniques, which frankly are beyond my knowledge to explain. This post contains some explanations by the author Matt Dowle, as well as an interesting, if short, piece of discussion between him and the author of dplyr, Hadley Wickham.

gaut
  • 5,771
  • 1
  • 14
  • 45
  • Thank you for the elaborate answer! Could you comment on why `fread` is better than the alternatives? – mhovd Nov 08 '20 at 20:23
  • Thanks again. Only issue I had was with large numeric columns, which were interpreted as `integer64`, but that was easily overcome by using `integer64 = "double"`. – mhovd Nov 09 '20 at 13:25