Memory issue when importing large zip tsv file to R

Question

I'm trying to download this file and have been unable to open it in any application. I'm attempting to run some analysis on it but the system is telling me that there isn't enough memory. I tried increasing the max limit but it is still saying that there isn't enough memory. Any assistance would be great.

library(readr)
file_url <- "http://samhda.s3-us-gov-west-1.amazonaws.com/s3fs-public/field-uploads-protected/studies/NSDUH-2002-2018/NSDUH-2002-2018-datasets/NSDUH-2002-2018-DS0001/NSDUH-2002-2018-DS0001-bundles-with-study-info/NSDUH-2002-2018-DS0001-bndl-data-tsv.zip"
zip <- tempfile(fileext = ".zip")
tsv_file <- download.file(file_url, zip, mode = "wb")
unzip_f <- unzip(tsv_file)
rawdata <- read_tsv(unzip_f, col_names= FALSE)
view(rawdata)
unlink(temp)

The error occurs at the rawdata step. I attempted to use col_names TRUE but some of the column names were unlabeled and without being able to view the data first I'm not able to name them myself. Here is the system and memory info.

> sessionInfo()
R version 3.6.3 (2020-02-29)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 
> memory.limit()
[1] 8113
> memory.size()
[1] 8093.52

Error I am receiving:

Error: cannot allocate vector of size 7.8 Mb

Did you try too read the file line by line using readLines instead of reading it in one single step? — Waldi, Jun 27 '20 at 21:50
I haven't used the readlines, is it possible to read it in sections and combine it after using that function? — Tesla_Republic, Jun 27 '20 at 21:52
this might help : https://stackoverflow.com/a/13712572/13513328 — Waldi, Jun 27 '20 at 21:53
Did you try [something like this](https://stackoverflow.com/questions/41108645/efficient-way-to-read-file-larger-than-memory-in-r). Also, have you tried increasing the amount of SWAP memory in your machine? Not sure how exactly does R deal with it, but in theory parts of the read file should be loaded into swap instead of ram, if the ram is full. — runr, Jun 28 '20 at 01:14

Len Greski · Accepted Answer · 2021-04-17T16:57:30.203

The skip = and n_max = arguments in readr::read_tsv() can be used to control how much data is read from a tab separated file into a data frame.

To read the first 10 observations so one can to see the column names, one can run:

library(readr)
file_url <- "http://samhda.s3-us-gov-west-1.amazonaws.com/s3fs-public/field-uploads-protected/studies/NSDUH-2002-2018/NSDUH-2002-2018-datasets/NSDUH-2002-2018-DS0001/NSDUH-2002-2018-DS0001-bundles-with-study-info/NSDUH-2002-2018-DS0001-bndl-data-tsv.zip"
zip <- tempfile(fileext = ".zip")
tsv_file <- download.file(file_url, zip, mode = "wb")
unzip_f <- unzip(zip,exdir="./data")
df <- read_tsv(unzip_f, col_names= TRUE, n_max = 10)

At this point we can retrieve the column names with the colnames() function.

col_names <- colnames(df)

We'll now validate the amount of RAM consumed by 100,000 rows of the data, and calculate the load time for 100,000 observations.

system.time(df_100000 <- read_tsv("./data/NSDUH_2002_2018_tab.tsv", 
                      col_names = TRUE, n_max = 100000))
format(object.size(df_100000),units = "auto")


   user  system elapsed 
 55.276   4.136  60.559

> format(object.size(df_100000),units = "auto")
[1] "2.7 Gb"

At this point we can safely read about 200,000 observations at a time from the raw data file on a machine that has 8Gb of RAM.

Next, we'll figure out how many rows of data are in the raw data file. We can use the col_types = argument of read_tsv() and set all columns except the first one to -, which tells read_tsv() not to read a column. We also calculate the size of the data frame with one column and all observations.

theTypes <- c("n",rep("_",3661))
system.time(df_obs <- read_tsv("./data/NSDUH_2002_2018_tab.tsv",col_types = theTypes, 
                                  col_names = TRUE))
nrow(df_obs)
format(object.size(df_obs),units = "auto")

   user  system elapsed 
175.208  27.694 210.948 
> nrow(df_obs)
[1] 949285
> format(object.size(df_obs),units = "auto")
[1] "39.8 Mb"

It took almost 4 minutes to read all observations for a single column of data from the raw data file on a MacBook Pro 15 with an Intel i7-4870HQ processor at 2.5Ghz.

The result of nrow() tells us that there are 949,285 rows in the raw data file. If we break the file up into 200,000 observation chunks, we can read them and save them as RDS files with saveRDS() for subsequent processing.

read the file iteratively and write as RDS

for(i in 1:5){
     df <- read_tsv("./data/NSDUH_2002_2018_tab.tsv",
                                         skip = (i - 1) * 200000,
                                         n_max = 200000,
                                         col_names = c_names)
     saveRDS(df,paste0("./data/usnuh_",i,".RDS"))
}

At this point, usnuh_1.RDS through usnuh_5.RDS can be read individually into R and analyzed.

NOTE: the for() loop overwrites the data frame created during the previous iteration, therefore we can read and write all of the files without running out of RAM. It's important to remember that one can only load 1 file with 200,000 observations and use it for data analysis on a machine with 8Gb of RAM. In order to use a different portion of the data, one needs to use the rm() function to remove the current data before loading another 200,000 observation RDS file into RAM.

Reading the last 200,000 rows

Per the comments, here is code that can be used to read the last 200K rows of the file.

# read last 200K rows. first read one row to obtain column names
library(readr) 
df <- read_tsv("./data/NSDUH_2002_2018_tab.tsv", 
               col_names = TRUE, n_max = 1)
c_names <- colnames(df)
# next, configure skip = relative to end of file and read 
df <- read_tsv("./data/NSDUH_2002_2018_tab.tsv",
               skip = (949258 - 200000),
               n_max = 200000,
               col_names = c_names)

When we view the data frame in the environment viewer, we can see that it contains 200,000 observations.

Thank you, this gave me some good background on how to approach a similar problem in the future as well. I really appreciate it. — Tesla_Republic, Jun 28 '20 at 16:06
@Tesla_Republic - You're welcome. If you found the answer to be helpful, please accept & upvote it. — Len Greski, Jun 28 '20 at 16:07
This did help for the most part but I still wasn't able to process the information and create the 5 separate files. I was able to at least get a file with 200k rows. I'm currently trying to figure out how to get the last 200k rows instead of the first 200k. I'm honestly not sure what the issue is, If I run the last bit of code (write the RDS) it says "Error: cannot allocate vector of size 30kb" and then if I type anything into the console it gives the same error and won't even autofill or process anything. I have to restart RStudio. I tried freeing up space on my memory as well. — Tesla_Republic, Jun 29 '20 at 00:05
This did help tremendously though and I don't think I need the entire dataset for my purposes. I would upvote but I'm a newbie and not able to yet. — Tesla_Republic, Jun 29 '20 at 00:06
@Tesla_Republic - I updated my answer to provide a solution that only reads the last 200,000 rows. Also, it's important to note that on a machine with 8Gb of RAM, you can only have one data frame of 200,000 rows and all columns in memory at a time. You need the rest of the RAM to run other R functions on the file. — Len Greski, Jun 29 '20 at 01:21

Memory issue when importing large zip tsv file to R

1 Answers1

read the file iteratively and write as RDS

Reading the last 200,000 rows