How to extract a portion of a large dataset via unzipping

Question

I have a very large species dataset from gbif (178GB) zipped, when unzipped its approximately 800gb (TSV) My Mac only has 512gb Memory and 8GB of Ram, however I am not in need of using all of this data.

Are their any approaches that I can take that can unzip the file without eating all of my memory and extracting a portion of the dataset by filtering out rows relative to a column? For example, it has occurrence values going back until 1600, I only need data for the last 2 years which I believe my PC can more than handle. Perhaps there is a library with a function that can filter rows when loading the data?

I am unsure of how to unzip properly, and I have looked to see unzipping libraries and unzip according to this article, truncates data >4GB. My worry is where could I store 800gb of data when unzipped?

Update: It seems that all the packages I have come across stop at 4GB after decompression. I am wondering if it is possible to create a function that can decompress at the 4GB, mark that point or data that has been retrieved, and begin decompression again from that point, and continue until the whole .zip file has been decompressed. It could store the decompressed files into a folder, that way you can access them with something like list.files(), any ideas if this can be done?

Check this post: https://stackoverflow.com/questions/12460938/r-reading-in-a-zip-data-file-without-unzipping-it/12950811. — tacoman, Jun 30 '21 at 10:06
@tacomanI have found a recent package named `disk.frame` that is specifically for my purpose (I think so at least). My data is currently still downloading, so I hope to update this thread using that library and perhaps others if I come to any difficulties but also an answer if it works well. — Stackbeans, Jun 30 '21 at 11:22

How to extract a portion of a large dataset via unzipping

0 Answers0