I am importing many (> 300) .csv-files in a project, and I stumbled upon a very strange occurence.
There is a noticable difference in size when comparing the results of read_csv
and read.csv
.
Windows lists the file size of all files to be ~442 MB.
Using readr
library(tidyverse)
datadir <- "Z:\\data\\attachments"
list_of_files <- list.files(path = datadir, full.names = TRUE)
readr_data <- lapply(list_of_files, function(x) {
read_csv(x, col_types = cols())
})
object.size(readr_data)
#> 416698080 bytes
str(readr_data[1])
#> List of 1
#> $ : tibble [2,123 x 80] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
Using base
methods
base_data <- lapply(list_of_files, function(x) {
read.csv(x)
})
object.size(base_data)
#> 393094616 bytes
str(base_data[1])
#> List of 1
#> $ :'data.frame': 2123 obs. of 80 variables:
# Compare size
object.size(readr_data) / object.size(base_data) * 100
#> 106 bytes
Now 6% may not be that much, but that is still 23 MB, and I am still interested in why these are different. Additionally, both of these are smaller than that reported by Windows.
Why are the lists of different size, and is that important?
EDIT: Apparently some of the classes are different. I used this method:
readr_class <- sapply(readr_data[[1]], class)
base_class <- sapply(base_data[[1]], class)
result <- data.frame(readr_class, base_class)
And these are the differences:
readr_class base_class
var1 numeric integer
var2 numeric integer
var3 numeric integer
var4 character integer