1

I am trying to download the dataset at the below link. It is about 14,000,000 rows long. I ran this code chunk, and I am stuck at unzip(). The code has been running for a really long time and my computer is hot.

I tried a few different ways that don't use unzip, and then I get stuck at the read.csv/vroom/read_csv step. Any ideas? This is a public dataset so anyone can try.

library(vroom)

temp <- tempfile()
download.file("https://files.consumerfinance.gov/hmda-historic-loan-data/hmda_2017_nationwide_all-records_labels.zip", temp)


unzip(temp, "hmda_2017_nationwide_all-records_labels.csv")


df2017 <- vroom("hmda_2017_nationwide_all-records_labels.csv")

unlink(temp)

user4157124
  • 2,809
  • 13
  • 27
  • 42
KVHelpMe
  • 81
  • 5

2 Answers2

2

I was able to download the file to my computer first.
then use vroom (https://vroom.r-lib.org/) to load it without unzipping it:

library(vroom)
df2017 <- vroom("hmda_2017_nationwide_all-records_labels.zip")

I get a warning about possible truncation, but the object has these dimensions:

> dim(df2017)
[1] 5448288      78

one nice thing about vroom, is that it doesn't load the data straight into memory.

brian avery
  • 403
  • 2
  • 8
  • Thanks! I guess I may have to do this, which kind of sucks, because I want to combine the 2007-2017 files (one for each year). It was uploading as a 14,000,000 by 62 size dataframe in python/pandas, but I want to use R (I kept running out of memory in python). – KVHelpMe Dec 22 '20 at 01:47
  • one good thing about `vroom` is that it doesn't load it all into memory, but if you want to combine several files it might get ugly in R too. you might want to select only the columns you want before trying to combine, might save some memory. – brian avery Dec 22 '20 at 02:47
  • It sounds like this could be the old problem of R's `unzip()` with files greater than 4GB (see: https://stackoverflow.com/questions/42740206/r-possible-truncation-of-4gb-file). It seems like you would need to unzip first, then load with vroom. Automatic detection of col_classes seems a bit wonky, too, so something like `vroom("hmda_2017_nationwide_all-records_labels.csv", col_types="dcccdcdcdcdcddcdcdcdccdcdccdcdcdcdcdcdcdcdcdcdcdcdcdcddcdcdcdcdccdcdcddddddddl")` is faster. For me, it still seems to load 9.5GB into memory. – user12728748 Dec 22 '20 at 17:41
2

Since the data set is quite large, 2 possible solutions:

With data.table (very fast, only feasible if the data fits into memory)

require(data.table)

system('curl https://files.consumerfinance.gov/hmda-historic-loan-data/hmda_2017_nationwide_all-records_labels.zip > hmda_2017_nationwide_all-records_labels.zip && unzip hmda_2017_nationwide_all-records_labels.zip')

dat <- fread("hmda_2017_nationwide_all-records_labels.csv")
# System errno 22 unmapping file: Invalid argument
# Error in fread("hmda_2017_nationwide_all-records_labels.csv") : 
#   Opened 10.47GB (11237068086 bytes) file ok but could not memory map it.
# This is a 64bit process. There is probably not enough contiguous virtual memory available.

With readLines (read data step-wise)

f <- file("./hmda_2017_nationwide_all-records_labels.csv", "r")
# if header:
header <- unlist(strsplit(unlist(strsplit(readLines(f, n=1), "\",\"")), ","))

dd <- as.data.frame(t(data.frame(strsplit(readLines(f, n=100), "\",\"") )))
colnames(dd) <- header
rownames(dd) <- 1:nrow(dd)

Repeat and add to the data frame if needed:

de <- t(as.data.frame( strsplit(readLines(f, n=10), "\",\"") ) )
colnames(de) <- header
dd <- rbind( dd, de )
rownames(dd) <- 1:nrow(dd)

close(f)

Use seek to jump within the data.

Andre Wildberg
  • 12,344
  • 3
  • 12
  • 29