0

I am running into memory problems and need help to fix it .I cannot publish the exact code or results here due to confidentiality issues of my company. However, I have used dummy references as below

There are 2 data frames as below

data frame A looks like
id  x_1  x_2   x_3  x_4
1  data   data  data data
2  data   data  data data
3  data   data  data data

data frame B looks like
id_1  x_1  x_2   x_3  x_4
1    data   data  data data
2    data   data  data data
3    data   data  data data

The hope was to get a combination result of the first columns of A and B as

id  id_1
1    1
1    2
1    3
2    1
2    2
2    3
3    1
3    2
3    3

Thus, used expand.grid as :

myLoadedData1 <- expand.grid(A$id,B$id)

The expand.grid was working fine when both A and B data frames had 8000 records each.

Due to scalability that cannot be avoided, the records have now increased to 50000 in both data frames. Now we see the below issue

 myLoadedData1 <- expand.grid(A$id,B$id)  
Error: cannot allocate vector of size 7.1 Gb

Please help the project is sort of stuck now and need ideas to move past this . Please see my session info below

R version 3.5.1 (2018-07-02)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] plyr_1.8.4        dplyr_0.7.7       odbc_1.1.6        data.table_1.11.8

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.17     assertthat_0.2.0 R6_2.2.2         DBI_1.0.0        magrittr_1.5     pillar_1.2.3     rlang_0.2.1      blob_1.1.1      
 [9] bindrcpp_0.2.2   tools_3.5.1      bit64_0.9-7      glue_1.2.0       purrr_0.2.5      bit_1.1-14       hms_0.4.2        yaml_2.1.19     
[17] compiler_3.5.1   pkgconfig_2.0.1  tidyselect_0.2.4 bindr_0.1.1      tibble_1.4.2    
  • 1
    If all you are doing is two numeric columns, can you use an external looping mechanism, either two `for` loops or two nested `lapply(1:8000, function(i) lapply(1:9000, function(j) { ...yourcodehere...}))`? (The only other option would be to use an analogy of a python's lazy generators. A prev answer of mine: https://stackoverflow.com/a/36144255/3358272, with a documented gist https://gist.github.com/r2evans/e5531cbab8cf421d14ed.) – r2evans Oct 31 '18 at 16:49
  • 1
    @r2evans: Thank you for helping, I tried the lapply approach but it failed stating maximum memory reached. I am now trying the for loop approach and will keep you posted of the results. – Shwetha Krishnan Nov 01 '18 at 14:28

0 Answers0