Read in large text file in chunks

Question

I'm working with limited RAM (AWS free tier EC2 server - 1GB).

I have a relatively large txt file "vectors.txt" (800mb) I'm trying to read into R. Having tried various methods I have failed to read in this vector to memory.

So, I was researching ways of reading it in in chunks. I know that the dim of the resulting data frame should be 300K * 300. If I was able to read in the file e.g. 10K lines at a time and then save each chunk as an RDS file I would be able to loop over the results and get what I need, albeit just a little slower with less convenience than having the whole thing in memory.

To reproduce:

# Get data
url <- 'https://github.com/eyaler/word2vec-slim/blob/master/GoogleNews-vectors-negative300-SLIM.bin.gz?raw=true'
file <- "GoogleNews-vectors-negative300-SLIM.bin.gz"
download.file(url, file) # takes a few minutes
R.utils::gunzip(file)

# word2vec r library
library(rword2vec)
w2v_gnews <- "GoogleNews-vectors-negative300-SLIM.bin"
bin_to_txt(w2v_gnews,"vector.txt")

So far so good. Here's where I struggle:

word_vectors = as.data.frame(read.table("vector.txt",skip = 1, nrows = 10))

Returns "cannot allocate a vector of size [size]" error message.

Tried alternatives:

word_vectors <- ff::read.table.ffdf(file = "vector.txt", header = TRUE)

Same, not enough memory

word_vectors <- readr::read_tsv_chunked("vector.txt", 
                                        callback = function(x, i) saveRDS(x, i),
                                        chunk_size = 10000)

Resulted in:

Parsed with column specification:
cols(
  `299567 300` = col_character()
)
|=========================================================================================| 100%  817 MB
Error in read_tokens_chunked_(data, callback, chunk_size, tokenizer, col_specs,  : 
  Evaluation error: bad 'file' argument.

Is there any other way to turn vectors.txt into a data frame? Maybe by breaking it into pieces and reading in each piece, saving as a data frame and then to rds? Or any other alternatives?

EDIT: From Jonathan's answer below, tried:

library(rword2vec)
library(RSQLite)

# Download pre trained Google News word2vec model (Slimmed down version)
# https://github.com/eyaler/word2vec-slim
url <- 'https://github.com/eyaler/word2vec-slim/blob/master/GoogleNews-vectors-negative300-SLIM.bin.gz?raw=true'
file <- "GoogleNews-vectors-negative300-SLIM.bin.gz"
download.file(url, file) # takes a few minutes
R.utils::gunzip(file)
w2v_gnews <- "GoogleNews-vectors-negative300-SLIM.bin"
bin_to_txt(w2v_gnews,"vector.txt")


# from https://privefl.github.io/bigreadr/articles/csv2sqlite.html
csv2sqlite <- function(tsv,
                       every_nlines,
                       table_name,
                       dbname = sub("\\.txt$", ".sqlite", tsv),
                       ...) {

  # Prepare reading
  con <- RSQLite::dbConnect(RSQLite::SQLite(), dbname)
  init <- TRUE
  fill_sqlite <- function(df) {

    if (init) {
      RSQLite::dbCreateTable(con, table_name, df)
      init <<- FALSE
    }

    RSQLite::dbAppendTable(con, table_name, df)
    NULL
  }

  # Read and fill by parts
  bigreadr::big_fread1(tsv, every_nlines,
                       .transform = fill_sqlite,
                       .combine = unlist,
                       ... = ...)

  # Returns
  con
}

vectors_data <- csv2sqlite("vector.txt", every_nlines = 1e6, table_name = "vectors")

Resulted in:

Splitting: 12.4 seconds.

 Error: nThread >= 1L is not TRUE

Jonathan Carroll · Answer 1 · 2018-10-08T23:05:05.460

1

Another option would be to do the processing on-disk, e.g. using an SQLite file and dplyr's database functionality. Here's one option: https://stackoverflow.com/a/38651229/4168169

To get the CSV into SQLite you can also use the bigreadr package which has an article on doing just this: https://privefl.github.io/bigreadr/articles/csv2sqlite.html

edited Oct 08 '18 at 23:05

answered Oct 08 '18 at 00:21

Jonathan Carroll

3,897
14
34

Thanks for the suggestion. The problem here is, after reviewing SQLite, I would need to create a table with the right field names before adding data to a table. Since I'm unable to even read part of the file I would be just guessing how many fields – Doug Fir Oct 08 '18 at 14:03
You could read a small chunk of the file into R and create the (empty) SQLite table from that (via RSQLite) then update it with the full data. That's pretty much what `bigreadr` does... I'll update my answer. – Jonathan Carroll Oct 08 '18 at 23:04
Hi @Jonathan, I tried following the example in your link with the results ```Splitting: 12.4 seconds. Error: nThread >= 1L is not TRUE ```. If you paste the entire code block from below my edit you should (hopefully) be able to replicate. Downloading the file with line ```download.file(url, file)``` will take a few minutes though. I feel like I'm close. I suspect maybe the format of the txt file (the example uses a csv). – Doug Fir Oct 09 '18 at 15:56

Read in large text file in chunks

1 Answers1