11

How to read part of the data from very large files?

The sample data is generated as:

set.seed(123)
df <- data.frame(replicate(10, sample(0:2000, 15 * 10^5, rep = TRUE)),
                 replicate(10, stringi::stri_rand_strings(1000, 5)))
head(df)
#     X1   X2   X3   X4   X5   X6   X7   X8   X9  X10  X1.1  X2.1  X3.1  X4.1  X5.1  X6.1  X7.1  X8.1  X9.1 X10.1
# 1  575 1843 1854  883  592 1362 1075  210 1526 1365 Qk8NP Xvw9z OYRa1 8BGIV bejiv CCoIE XDKJN HR7zc 2kKNY 1I5h8
# 2 1577  390 1861  912  277  636  758 1461 1978 1865 ZaHFl QLsli E7lbs YGq8u DgUAW c6JQ0 RAZFn Sc0Zt mif8I 3Ys6U
# 3  818 1076  147 1221  257 1115  759 1959 1088 1292 jM5Uw ctM3y 0HiXR hjOHK BZDOP ULQWm Ei8qS BVneZ rkKNL 728gf
# 4 1766  884 1331 1144 1260  768 1620 1231 1428 1193 r4ZCI eCymC 19SwO Ht1O0 repPw YdlSW NRgfL RX4ta iAtVn Hzm0q
# 5 1881 1851 1324 1930 1584 1318  940 1796  830   15 w8d1B qK1b0 CeB8u SlNll DxndB vaufY ZtlEM tDa0o SEMUX V7tLQ
# 6   91  264 1563  414  914 1507 1935 1970  287  409 gsY1u FxIgu 2XqS4 8kreA ymngX h0hkK reIsn tKgQY ssR7g W3v6c

saveRDS is used to save the file.

saveRDS(df, 'df.rds')

The file size is looked using the below commands:

file.info('df.rds')$size
# [1] 29935125
utils:::format.object_size(29935125, "auto")
# [1] "28.5 Mb"

The saved file is read using the below function.

readRDS('df.rds')

However, some of my files are in GBs and would need only few columns for certain processing. Is it possible to read selected columns from RDS files?

Note: I already have RDS files, generated after considerably large amounts of processing. Now, I want to know the best possible way to read selected columns from the existing RDS files.

Prradep
  • 5,506
  • 5
  • 43
  • 84
  • Seriously consider csv files using `fwrite` and `fread` from `data.table` instead. If they are fast enough for you, you have option to select columns. – anotherfred Nov 14 '17 at 19:45
  • 1
    I don't believe that you can do this with RDS. But you might want to check out [`feather`](https://github.com/wesm/feather/tree/master/R), which allows you do do this with `read_feather("mtcars.feather", columns = c("mpg", "cyl"))` – austensen Nov 14 '17 at 19:46
  • Also, there is also a newer package, [`fst`](http://www.fstpackage.org/), which is apparently even faster than `feather` and can also read in only specific columns. – austensen Nov 14 '17 at 19:56
  • I have RDS files and I want to read selected columns from RDS files which contain a single object. Generation of the datasets in another format is highly time-consuming. – Prradep Nov 14 '17 at 20:12
  • 2
    Interesting package, @austensen. Note for both `feather` and `fst` is that they are still young and both recommend against use for long-term storage of data (as the on-disk format may change). – r2evans Nov 14 '17 at 21:04
  • 1
    Prradep: since it appears you are unable to deal with different formats, it may be that you'll have to dive into the source of [`unserializeFromConn`](https://github.com/wch/r-source/blob/0b202fe7ad5189e4e72206bcb3e957535b24baaa/src/main/serialize.c#L2270), the internal function used to actually read the binary structure. Good luck. – r2evans Nov 14 '17 at 21:05
  • 1
    It would be great if readRDS had an option to show the structure of RDS files without loading them completly, and also to load just some sections. – skan Nov 27 '20 at 19:10

2 Answers2

3

I don't think you can read only a portion of an rds or rda file.

An alternative would be to use feather. As an example, using a large-ish feather I'm working with:

library(feather)
file.info("../feathers/C1.feather")["size"]
#                              size
#  ../feathers/C1.feather 498782328

system.time( c1whole <- read_feather("../feathers/C1.feather") )
#     user  system elapsed
#    0.860   0.856   5.540
system.time( c1dyn <- feather("../feathers/C1.feather") )
#     user  system elapsed
#        0       0       0

ls.objects()
#             Type      Size PrettySize          Dim
#  c1dyn   feather      3232     3.2 Kb 2886147 x 36
#  c1whole  tbl_df 554158688   528.5 Mb 2886147 x 36

You can react with both variables as full data.frames: though c1whole is already in memory (so may be a little faster), accessing c1dyn is still quite speedy.

NB: some functions (e.g., several within dplyr) do not work on feather as they do on data.frame or tbl_df. If your intent is solely to pick-and-choose specific columns, then you'll be fine.

r2evans
  • 141,215
  • 6
  • 77
  • 149
  • `fwrite` seems a lot faster than feather - assuming you can use csv of course! https://blog.dominodatalab.com/the-r-data-i-o-shootout/ – anotherfred Nov 14 '17 at 19:49
  • Ok. I suspect that the read time is more of a factor, so a write-once time penalty might be worth it. In this case, the only reason I would personally go with `csv` is if I need the data in a tool other than R and python (such as SQL bulk copy) where `feather` access was not available. – r2evans Nov 14 '17 at 20:00
  • Thanks for the solution. I am looking for the ways of reading selected columns from an existing RDS files. I would consider this approach next time before generating the files in RDS format. – Prradep Nov 14 '17 at 20:14
2

SQLite also could be a common way to store tabular/matrix/dataframe data on your hard drive using an SQLite database. This also allows the use of standard SQL commands or DPLYR to interrogate the data. Just be warned that SQLite does not have a date format so any dates need to be converted to character before writing them to the database.

set.seed(123)
df <- data.frame(replicate(10, sample(0:2000, 15 * 10^5, rep = TRUE)),
                 replicate(10, stringi::stri_rand_strings(1000, 5)))

library(RSQLite)
conn <- dbConnect(RSQLite::SQLite(), dbname="myDB")
dbWriteTable(conn,"mytable",df)
alltables <- dbListTables(conn)
# Use sql queries to query data...
oneColumn <- dbGetQuery(conn,"SELECT X1 FROM mytable")

library(dplyr)
library(dbplyr)
my_db <- tbl(conn, "mytable")
my_db
# Use dplyr functions to query data...
my_db %>% select(X1)
Adam Sampson
  • 1,971
  • 1
  • 7
  • 15
  • Thanks for the solution. I am looking for the ways of reading selected columns from an existing RDS files. Next time, I would consider using this SQL based approach in relevant scenarios. – Prradep Nov 14 '17 at 20:14
  • Alas. Sorry I can't be more help with the RDS. I just started using RDS on my most recent project to store trained models and don't know enough about it. – Adam Sampson Nov 14 '17 at 20:24