1

Never had this problem before but now it's constantly there for any piece of code I write.

> sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)

Matrix products: default

Random number generation:
 RNG:     Mersenne-Twister 
 Normal:  Inversion 
 Sample:  Rounding 
      

I have a dataset of 1482236 observations and 52 variables. 35 of those are factors and 19 are numeric. some of my factors are huge, lots of levels: Those are the highest:

forename: 114942 levels
surname: 201988 levels
postcode:793876 levels
partnername: 9164 levels

I have tried different functions and they all not work for same reason: Error: cannot allocate vector of size X

  1. Correlation for categorical variables:
df <- DFOld %>% dplyr::select_if(~class(.)=='factor') #create a dataframe just for factors

##function to get chi square p value and Cramers V

f = function(x,y) {
  tbl = df %>% dplyr::select(x,y) %>% table()
  chisq_pval = round(chisq.test(tbl)$p.value, 4)
  cramV = round(cramersV(tbl), 4) 
  data.frame(x, y, chisq_pval, cramV) }

##create unique combinations of column names


df_comb = data.frame(t(combn(sort(names(df)), 2)), stringsAsFactors = F)
df_res = map2_df(df_comb$X1, df_comb$X2, f)
Error: cannot allocate vector of size 7.9 Gb

2. Imputation with mice

impute <- mice(df, m=3, nnet.MaxNWts=3000, seed=123, meth='cart') #, meth='cart'
Error: cannot allocate vector of size 100.1 Gb

3. XgBoostModel

library(xgboost)
set.seed(123)
caret.cv.Test <- train(HasProduct ~., data=roseTest
                       [,!(colnames(roseTest)%in% c("id"
))], 
method = "xgbTree",
tune.gridV0= tune.grid,
trControl = train.control,
)

Error: cannot allocate vector of size 100.1 Gb

I have tried to consider half of dataset per time in order to reduce the number of observations but it makes no difference. I have also tried to expand the memory:

> memory.size()
[1] 29882.43
> memory.limit() 
[1] 32447
> memory.limit(size=500000)
[1] 5e+05

it did not have any effect

I have also cleaned up the garbage

> gc()
            used   (Mb) gc trigger    (Mb)   max used    (Mb)
Ncells   4570479  244.1   12169265   650.0   13298034   710.2
Vcells 144700496 1104.0 5750893687 43875.9 5988830217 45691.2

Nothing has created any difference, the only thing that has worked it has been excluding those variables:

forename: 114942 levels
surname: 201988 levels
postcode:793876 levels
partnername: 9164 levels

do you know if there is a way to keep those variables while not killing the RAM?

Please help if you can,

Cheers

Phil
  • 7,287
  • 3
  • 36
  • 66
VR88
  • 21
  • 1
  • 7
  • You might check this: https://stackoverflow.com/questions/5171593/r-memory-management-cannot-allocate-vector-of-size-n-mb They suggest some actions (some of which you already performed), and it is windows specific – Fabio Marroni Nov 05 '20 at 11:39
  • Hey Fabio, thanks I found out this post when I started facing the issue, but I took all the possible solutions from there, except the sparse matrix and that's because I need to write a tree model on those data. Also many on the that post point out to the fact that having a 64-bit would sort the problem. but clearly it doesn't :( – VR88 Nov 05 '20 at 14:38
  • Ouch, that's a pity! Could it be that you simply run out of RAM? How much RAM do you have on your machine? In addition, maybe you can try what is suggested in the accepted answer of this post: https://stackoverflow.com/questions/1395229/increasing-or-decreasing-the-memory-available-to-r-processes. Basically it suggest to increase both --max-mem-size and --max-vsize, and apparently vsize affect the vector heap size, whatever the heap size is! And apparently, windows implementation of R has limitations on memory management, sorry!! http://127.0.0.1:19763/library/base/html/Memory-limits.html – Fabio Marroni Nov 06 '20 at 08:33

0 Answers0