0

I have a .txt file that is roughly 28 gb and 100 million rows, and originally we were using Python and a mix of Pandas and SQLite3 to load this data into a .db file that we could query on. However, my team is more familiar with R, so we want to load the .db file into R instead. However, that comes with a memory limit. Is there a workaround for this error? Is it possible to partially load some of the data into R?

library(RSQLite)

filename <- "DB NAME.db"
sqlite.driver <- dbDriver("SQLite")
db <- dbConnect(sqlite.driver,
                dbname = filename)

## Some operations
dbListTables(db)
mytable <- dbReadTable(db,"TABLE NAME")

Error: vector memory exhausted (limit reached?)
AENick
  • 301
  • 1
  • 2
  • 8
  • 1
    Perhaps this is relevant? https://stackoverflow.com/questions/51295402/r-on-macos-error-vector-memory-exhausted-limit-reached. Your system can use paging (pretending your disk is RAM) to fake more memory so this is probably an R problem. – nlta May 06 '21 at 20:47
  • 4
    To me, the definition of "Big Data" is heavily conditioned on the computer doing the work: if your computer can hold everything in memory, then it is not Big. One way to attack a challenge with Big Data is to get more RAM in the computer, so that you can load everything and do what you want with it; this approach has its limits, though, and as data multiplies, this may be a difficult race to maintain. A more future-proof approach involves using a database and doing as much as you can within the database. This "db" can be sqlite/duckdb, a larger dbms, or perhaps "disk-frames". – r2evans May 06 '21 at 20:53
  • 2
    I suggest that extending the memory-use of R so that it starts using paging in your OS is going to slow things down to a horrendous crawl, especially when your OS starts "thrashing". This can be especially problematic with R with its *copy-on-write* and *pass-by-value* semantics, where changes to a frame often result in part or all of it being duplicated in memory, not what you need here. One way to mitigate that is with `data.table`, which uses *by-reference semantics*. That assumes you can get it into memory, though, so is a discussion for another day. Good luck! – r2evans May 06 '21 at 20:54
  • 3
    Is it really necessary to hold the complete table in RAM? If not, then read records sequentially or access only the part of the records needed. This is the usual data base approach. More can be found in the CRAN task views about [Databases](https://cran.r-project.org/web/views/Databases.html). If you prefer to access data as awhole table, consult section "Large memory and out-of-memory data of [High-Performance and Parallel Computing with R](https://cran.r-project.org/web/views/HighPerformanceComputing.html), or buy more RAM, or book an suitable cloud server. – tpetzoldt May 06 '21 at 21:33
  • 1
    A special case of @tpetzoldt's comment: try the [dbplyr package](https://cran.r-project.org/web/packages/dbplyr/index.html) to query a DB file from within R. (It's included in the Databases task view). – Ben Bolker May 06 '21 at 21:44

1 Answers1

3

You might want to try the vroom package to load the .txt file you mention, which allows you to select which columns to load, thus conserving memory.

filename <- "DB_NAME.txt"
db <- vroom::vroom(filename, col_select = c(col_1,col_2))

Have a look at the vroom vignette document for more information about how memory is conserved.

SEAnalyst
  • 1,077
  • 8
  • 15