0

I've been trying to load the Fannie Mae loan data in R, which is available in "txt" format from their website (https://loanperformancedata.fanniemae.com/lppub/index.html#) I'm using the data import code provided by them, but I run into an error "cannot allocate vector of size n mb" I'm only trying to read and rbind 4 files as of now (each 600-700mb approx.) but need to do that for many more. I'm using a laptop with 8GB RAM, and 64-bit RStudio. Any suggestions on what to do? The code uses "fread" along with doMC/doParallel, and as i understand thats as efficient as it can be.

Here's the code:

library(doMC) 

registerDoMC(30)

Performance_Data <- foreach(k=1:numberofcores, .inorder=FALSE, .combine=rbind,
                        .packages=c("data.table")) %do% {
                          Data_P<- fread(Performance[k], sep = "|", colClasses=Performance_ColClasses, showProgress=TRUE)
                          setnames(Data_P, Performance_Variables)
                          setkey(Data_P, "LOAN_ID")
                        }
ANP
  • 51
  • 1
  • 9
  • 1
    Running stuff parallel, increases the amount of required memory as R needs to run multiple instances and load the data in each one. It might help to run the code on one core. But if you have many more files, you might need a machine with more RAM at some point. – JBGruber Jul 06 '18 at 13:36
  • @JonGrub Thanks for the reply, I'd been looking at the Memory usage, and just opening Rstudio reduces available memory to 2.5GB. If i don't use parallel processing would it be able to read such large files with an issue? each file is 700mb. – ANP Jul 06 '18 at 13:39
  • @JonGrub I actually tried reading in first 4 files without parallel processing, and im facing the same issue still. – ANP Jul 06 '18 at 13:51
  • 1
    Do you have objects in your environment when you open RStudio? Maybe there is an .RData file in your default working directory which is loaded at start? The way you describe it R would consume around 5GB of RAM, which is excessive and not normal. – JBGruber Jul 06 '18 at 14:01
  • 1
    Run `ls()` and/or look at the Environment tab in RStudio to see object in your environment. Use `rm()` to clean up unused objects. If you continue to have trouble with RStudio, try using the R command line. – Gregor Thomas Jul 06 '18 at 14:11
  • Suggested dupe: [R memory management / cannot allocate vector of size n Mb](https://stackoverflow.com/q/5171593/903061) – Gregor Thomas Jul 06 '18 at 14:12
  • *"If I don't use parallel processing, would it be able..."*, perhaps I'm missing something, but if you can load it into memory with parallel procs, you can load it with single. The reverse is not guaranteed, since there is some memory overhead to having multiple R sessions on the same machine. Bottom line: I don't know of a memory cap on a single R session, so splitting into multiple will do nothing to enable loading more data. – r2evans Jul 06 '18 at 14:44
  • @JonGrub I did happen to have a .RData file loaded on startup by default, but even after correcting that i am having trouble. Although the available RAM on starting R is now 5GB. – ANP Jul 06 '18 at 17:56
  • @Gregor tried cleaning up the environment, but no change!! I'm new to R, so could you tell me how to run/use R scripts in command line? I don't have Linux, its Windows – ANP Jul 06 '18 at 17:58
  • As long as R is on your path, open up the command prompt and type `R`. On windows, you might try the RGui that comes with the R installation as a lighter-weight IDE. That said, it sounds like you did have a change, with 5 GB available instead of 2.5. – Gregor Thomas Jul 06 '18 at 18:01
  • Yes, after cleaning the environment and having no preloaded data on startup, i do get 5GB, but it doesnt seem to make a difference. On running the code used RAM shoot to almost 7GB. and then i see the error – ANP Jul 06 '18 at 18:04
  • You need to add the path to the R binary to your path environment variable. The path looks somewhat like this `C:\Program Files\R\R-3.5.0\bin\`. Look up for your windows version how to add something to PATH. – JBGruber Jul 06 '18 at 18:04

1 Answers1

0

As has been identified in the comments already, your problem is your lack of available memory. This might seem surprising to you as you have 8 GB total RAM. However, with txt files with more than a couple of 100MB it might simply not be enough, even when you do not go parallel (which causes additional memory overhead).

I have an example of something I tried recently. As the data you describe is not availbale without an account this might make more sense to show here:

library("data.table")
download.file(url = "http://download.geonames.org/export/dump/allCountries.zip",
              destfile = "allCountries.zip",
              mode = "wb",
              cacheOK = FALSE,
              extra = character())
unzip(zipfile = "allCountries.zip")

These first couple of lines simply download and unzip the .txt with the data. Note, that the unzipped .txt file is 1.4 GB large.

geonames <- fread("allCountries.txt", 
                  quote = "",
                  sep = "\t",
                  col.names = c(
                    "geonameid",
                    "name",
                    "asciiname",
                    "alternatenames",
                    "latitude",
                    "longitude",
                    "feature class",
                    "feature code",
                    "country code",
                    "cc2",
                    "admin1 code",
                    "admin2 code",
                    "admin3 code",
                    "admin4 code",
                    "population",
                    "elevation",
                    "dem",
                    "timezone",
                    "modification date"
                  ))

format(object.size(geonames), units = "GiB")
#> [1] "2.9 Gb"

As you can see, the file more than doubled its size after it was read into R. To read in four 700mb files, you would thus need 5.6GB available RAM. In Windows that can be a challange, depending on what else is running in the background. You could think about reading in the files one by one, saving them as .RDS files and then merge them together. But it wouldn't change the fact that you can't have all data open in R at once.

What I would suggest is to look into the dbplyr package and save your data in, for example, an SQLite database. The way to go forward would be to read in the files one by one and write the data to the database. This way you can query the data you need when you need it. Or get more RAM. That would also help.

JBGruber
  • 11,727
  • 1
  • 23
  • 45
  • Thank you so much for the suggestions!! I'l try it out and hopefully it'll work. If i am using SQLite Database, can i merge the files together and work on it as a whole? I need to run some statistical analysis, which won;t work in pieces. – ANP Jul 07 '18 at 01:20
  • When you have a SQLite databse, you can pull only the data you need from it without loading everything into your memory. If you want to run, e.g., a regression, you can use just the columns you need and even filter out values you are not intrested in (e.g. outliers). Using `dbplyr` you can use the same logic which is behind `dplyr` to directly work with the data. I learned it from this [vignette](https://cran.r-project.org/web/packages/dbplyr/vignettes/dbplyr.html) and was surprised how easy it is. – JBGruber Jul 07 '18 at 09:35
  • I set up a database using SQLite and tried to read in 2 of the files into a table, but it won't work. Do you suggest more RAM as the only best option? Cause i need to merge all files together and run a predictive model. How can i use the data stored in DB for that,? – ANP Jul 09 '18 at 11:09
  • You get out of memory error with just 2 of the files? Have you tried just one? Just one core as well? If so, I think getting more RAM is the only option. But I'm surprised as it shouldn't take up that much. Have you looked into the dbplyr vignette? I think you should be able to run a predictive model but I have no experience with that. Another option would be to use cloud computing. There is a nice package which covers the big services I think: https://github.com/cloudyr – JBGruber Jul 09 '18 at 11:35
  • Yes, even tried it with 1, and using just one core. Will look into cloud computing and meanwhile try and upgrade RAM. – ANP Jul 10 '18 at 06:48