Trying to save disc space by loading a csv file into R directly from zip using fread(). Just wondering if there's a way to get something akin to nrow() or dim() from the csv (within the zip) before loading in order to get an idea of how large the object will be and to avoid running out of available ram. Any suggestions? If there's a better way to determine if the csv will be too large when uncompressed and loaded into R, that would also be good to know. Thanks (p.s. using Windows 10).
Asked
Active
Viewed 633 times
6
-
https://www.r-bloggers.com/easy-way-of-determining-number-of-linesrecords-in-a-given-large-file-using-r/ – LocoGris Feb 14 '19 at 16:55
-
1You could also run `unzip -l
` in CMD, which lists the contained files, along with the total uncompressed size. – Mako212 Feb 14 '19 at 16:56 -
2Essentially `shell(shQuote(sprintf("unzip -l %s", file.choose()))` – Mako212 Feb 14 '19 at 16:58
-
2Possible duplicate of [Extract bz2 file in R](https://stackoverflow.com/questions/25948777/extract-bz2-file-in-r) – krads Mar 16 '19 at 11:21
-
This isn't a duplicate of that question, because macsmith is asking how to efficiently just do a size/row count. That question only explains how to directly read & interact with the data. – Barett Mar 16 '19 at 22:59
1 Answers
0
A very good alternative especially in regard to reading zipped files fast is vroom:
https://vroom.r-lib.org: "... it simply indexes where each record is located so it can be read later." So it should be safe to load very big datasets without risking running into lockouts.
require(vroom)
vroom("./data.csv.gz")
# indexed 0B in 0s, 0B/sindexed 1.00TB in 0s, 1.25PB/sRows: 200
# Columns: 6
# Delimiter: ","
# chr [6]: Column1, Date, Column2, Subtable_Column1, Subtable_Column2, Subtable_Column3
#
#
# Use `spec()` to retrieve the guessed column specification
# Pass a specification to the `col_types` argument to quiet this message
# A tibble: 200 x 6
... <data> ...

Andre Wildberg
- 12,344
- 3
- 12
- 29