4

fread cannot read a .csv file of 300Mb with 200Gb of free RAM and falls with an error

Error: cannot allocate vector of size 5.6 Mb

Task manager screenshot:

enter image description here

The file contains 373522 rows and 401 columns of which 1 column (identifier) is character and 400 columns are numeric.

UPD: this issue seems to be not related to lack of RAM but to freed allocation mechanism, because as mentioned above I have 200Gb free RAM and want to reed only 300Mb csv file with numeric columns

UPD2: VERBOSE output added

How do I read the file:

lang-r
data <- fread(
    file = fn,
    sep = ",",
    stringsAsFactors = FALSE,
    data.table = FALSE,
    nrows = 1
)

col_classes <- c(
    "character",
    rep("numeric", ncol(data) - 1)
) 

data <- fread(
    file = fn,
    sep = ",",
    na.strings = c("NA", "na", "NULL", "null", ""),
    stringsAsFactors = FALSE,
    colClasses = col_classes,
    showProgress = TRUE,
    data.table = FALSE
)


lang-r
> file.size(fn)
[1] 331201365

Session info:

lang-r
> sessionInfo()
R version 3.5.2 (2018-12-20)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server >= 2012 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] stringr_1.3.1     purrr_0.2.5       dplyr_0.7.8       data.table_1.12.0 crayon_1.3.4     

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.0       assertthat_0.2.0 R6_2.3.0         magrittr_1.5     pillar_1.3.1     stringi_1.2.4   
 [7] rlang_0.3.1      rstudioapi_0.9.0 bindrcpp_0.2.2   tools_3.5.2      glue_1.3.0       yaml_2.2.0      
[13] compiler_3.5.2   pkgconfig_2.0.2  tidyselect_0.2.5 bindr_0.1.1      tibble_2.0.1

Please let me know if VERBOSE output of freed required.

lang-r
omp_get_max_threads() = 64
omp_get_thread_limit() = 2147483647
DTthreads = 0
RestoreAfterFork = true
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 64 threads (omp_get_max_threads()=64, nth=64)
  NAstrings = [<<NA>>, <<na>>, <<NULL>>, <<null>>, <<>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as integer
[02] Opening the file
  Opening file I:/secret_file_name.csv
  File opened, size = 315.9MB (331201365 bytes).
  Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \n has been found in the input and different lines can end with different line endings (e.g. mixed \n and \r\n in one file). This is common and ideal.
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<id,column_1>>
[06] Detect separator, quoting rule, and ncolumns
  Using supplied sep ','
  sep=','  with 100 lines of 301 fields using quote rule 0
  Detected 301 columns on line 1. This line is either column names or first data row. Line starts as: <<id,column_1>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 301
[07] Detect column types, good nrow estimate and whether first row is column names
  Number of sampling jump points = 100 because (331201363 bytes from row 1 to eof) / (2 * 163458 jump0size) == 1013
  Type codes (jump 000)    : A7777777777777777777777557777777557777775577777777755777777755777555777555777777...7777777777  Quote rule 0
  Type codes (jump 002)    : A7777777777777777777777557777777557777775577777777755777777755777777777557777777...7777777777  Quote rule 0
  Type codes (jump 020)    : A7777777777777777777777557777777557777775577777777755777777755777777777557777777...7777777777  Quote rule 0
  Type codes (jump 027)    : A7777777777777777777777777777777557777775577777777755777777777777777777557777777...7777777777  Quote rule 0
  Type codes (jump 058)    : A7777777777777777777777777777777777777777777777777777777777777777777777557777777...7777777777  Quote rule 0
  Type codes (jump 100)    : A7777777777777777777777777777777777777777777777777777777777777777777777557777777...7777777777  Quote rule 0
  'header' determined to be true due to column 2 containing a string on row 1 and a lower type (float64) in the rest of the 10059 sample rows
  =====
  Sampled 10059 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 2 to the end of last row: 331159811
  Line length: mean=903.71 sd=756.62 min=326 max=4068
  Estimated number of rows: 331159811 / 903.71 = 366444
  Initial alloc = 732888 rows (366444 + 100%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 6 type and 0 drop user overrides : A7777777777777777777777777777777777777777777777777777777777777777777777777777777...7777777777
[10] Allocate memory for the datatable
  Allocating 301 column slots (301 - 0 dropped) with 732888 rows
Error: cannot allocate vector of size 5.6 Mb
jay.sf
  • 60,139
  • 8
  • 53
  • 110
  • Possible duplicate of [R memory management / cannot allocate vector of size n Mb](https://stackoverflow.com/questions/5171593/r-memory-management-cannot-allocate-vector-of-size-n-mb) – NelsonGon Feb 05 '19 at 06:13
  • check what is returned by >.Machine$sizeof.pointer 8 mean 64 bit which is fine and if its 4 than it can be an issue aswell – Jabir Feb 05 '19 at 06:48
  • .Machine$sizeof.pointer returns 8. I assume the issue is related to allocation mechanism of freed, because my other R jobs can use all 512Gb of RAM. – cat_zeppelin Feb 05 '19 at 07:00
  • @cat_zeppelin can you please run memory.limit() and share result – Jabir Feb 05 '19 at 07:06
  • @Jabir > memory.limit() [1] 523713 – cat_zeppelin Feb 05 '19 at 07:09
  • @cat_zeppelin Can you please check memory.size() ? – Jabir Feb 05 '19 at 07:11
  • 1
    Yes, show `verbose` output and also at least some lines from the file. – Roland Feb 05 '19 at 07:12
  • Also, why do you use `fread` twice for the same file? – Roland Feb 05 '19 at 07:18
  • @Roland i use freed 1st time to read 1 row in order to get number of columns and 2nd time to read entire file with specified colClasses. All columns of my file are numeric beside of the 1st col which is character identifier. Verbose output added to the question body – cat_zeppelin Feb 05 '19 at 07:42
  • @Jabir > memory.size() [1] 523658.7 – cat_zeppelin Feb 05 '19 at 07:45
  • Do one thing run gc() and than check size() and limit(). – Jabir Feb 05 '19 at 07:48
  • 1
    gc() changed nothing. I trust Hadley who says gc() never helps. – cat_zeppelin Feb 05 '19 at 08:06
  • Start a new R session, try to read smaller file (less rows, less columns?), try to use fread once as we already know the number and type of columns. – zx8754 Feb 05 '19 at 09:38
  • it seems the issue is related to memory leak caused by freed's low level C allocation mechanism (plz see screenshot in the body of the question, "committed memory" vs "in use"). I'll try to use readr or something and report if it works – cat_zeppelin Feb 05 '19 at 09:45
  • Restart of R session helped. All objects restored properly and committed memory became equal to "in use" i.e. around 290Gb. Possible problem as I said probably caused by fread's memory leak. – cat_zeppelin Feb 05 '19 at 10:03
  • If restarting helped, that is great, do you think this problem is reproducible? – zx8754 Feb 05 '19 at 10:51
  • If you can reproduce this problem, please [file an issue on the data.table Github development page](https://github.com/Rdatatable/data.table). – Jaap Feb 05 '19 at 11:01
  • @Jaap, yes, the issue is reproducible. I'll submit it in the nearest future. – cat_zeppelin Feb 05 '19 at 11:08

2 Answers2

0

R processes your data on your RAM. So the size of your global environment can be at most the size of the allocated RAM to R.

Here some tricks.

1 - use gc() to force garbage collection

2 - delete unnecessary data

3 - use smaller data types like integer instead of numeric

Have a look at to my previous answer here.

boyaronur
  • 521
  • 6
  • 18
0

It seems the memory leak is not related to data.table::fread(). I had a memory leak when I was trying to read 74 csv files of 82Gb in total and bind_cols all of them. I use the following (pseudo)code for this:


files <- list.files(some_dir)

result <- fread("the_main_file.csv", ...)

# to ensure all ids of all files are equal to avoid left_join
check_ids <- c()

for (fn in files) {
    data <- fread(fn, other_important_parameters)
    check_ids <- check_ids %>% c(sum(result$id == data$id))
    data$id <- NULL
    result <- result %>% bind_cols(data)
    # this probably is useless
    remove(data)
}

In this case I get an error "cannot allocate..." with 290 of 512 Gb RAM in use and 512 of 512 Gb committed (something like shown on the screenshot in the Q's body)

But when I try to bind subset of data (n-1 rows instead of n rows) I do not have any issues with memory! Task manager shows 290 GB in use and committed.

Here is the updated pseudo code which works without memory leak.


files <- list.files(some_dir)

result <- fread("the_main_file.csv", ...)

ix <- rep(TRUE, nrow(result))
ix[1] <- FALSE

result <- result[ix, ]


# to ensure all ids of all files are equal to avoid left_join
check_ids <- c()

for (fn in files) {
    data <- fread(fn, other_important_parameters)
    check_ids <- check_ids %>% c(sum(result$id == data$id))
    data$id <- NULL
    result <- result %>% bind_cols(data[ix, ])
    # this probably is useless
    remove(data)
}

So, it seems fread works correctly and there is something wrong with binding of data.frames...

Could someone explain why it happens?

smci
  • 32,567
  • 20
  • 113
  • 146
  • I see why you're doing it sequentially, but does `datalist <- lapply(files, fread, other_parameters); result <- bind_cols(datalist)` work? (I know this leaves out a few details of what you want to do ...) – Ben Bolker Feb 10 '19 at 15:56
  • @BenBolker Ofc, there are other ways to get the same result. I use regular for because I have a progress bar and sexy colorful console output and some other aux things inside the loop. My goal is to figure out why my original code causes memory leak. – cat_zeppelin Feb 10 '19 at 16:10
  • 1
    My point wasn't about for vs lapply (I don't care :-) ) but about reading the individual elements into a list first, then applying bind_cols to the list, rather than iteratively binding. Naively it would seem that iteratively binding would take less memory overhead, but I wonder if that naive expectation is right ... – Ben Bolker Feb 10 '19 at 16:27
  • @BenBolker, I see your point. Thank you. But the main issue is not a lack of RAM, but a memory leak. I can not read all N rows due to this problem, but binding of N-1 rows works excellent. – cat_zeppelin Feb 11 '19 at 08:09
  • @BenBolker apply approach does not work as well. 268RAM in use but 511 of 511 committed. – cat_zeppelin Feb 11 '19 at 08:38