0

After running this code:

t1 <-Sys.time()
df.m <- left_join(df.h,daRta3,by=c("year","month","MA","day"))
t2 <- Sys.time()
difftime(t2,t1)

I have this error.

Error: std::bad_alloc

The dimension of the matrix that I have tried to create is 74495*2695 = 180.10^6 rows.

The computer in which I run the code has 20 GB of RAM

I tried the memory.limit() but it did not solve my issue.

Andrew
  • 14,325
  • 4
  • 43
  • 64
DoubleMD
  • 1
  • 1
  • 1
  • 1
  • The physical amount of RAM is not only thing that's relevant. Are compiling as a 64-bit application? – Algirdas Preidžius Jul 25 '16 at 14:24
  • 2
    Welcome to StackOverflow! Please read the info about [how to ask a good question](http://stackoverflow.com/help/how-to-ask) and how to give a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610). This will make it much easier for others to help you. – Jaap Jul 25 '16 at 14:32
  • 74495*2695=200,764,025. Besides the calculation also a suggestion: a possible solution is to 'cut' your data sets in multiple blocks and apply the left_join on the different sets. – StatMan Jul 25 '16 at 14:46
  • 1
    @ Algirdas Preidžius yeah i am compiling as 64-bit Application. @MarcelG that's finally how i manage to solve the problem, I divided my database into multiple datasets et done the left join. It is still slow but it work. Thank you guys for your help – DoubleMD Jul 27 '16 at 14:31

1 Answers1

0
  1. Examine cardinality of your join key

    • Is the c("year","month","MA","day") unique in both df.h and daRta3?
    • What are the most frequent values?
  2. NA values. left_join can treat NA values as equal or different:

    > tibble(x = c(NA, NA, NA)) %>% left_join(., ., by = 'x')
    # A tibble: 9 x 1
      x    
      <lgl>
    1 NA   
    2 NA   
    3 NA   
    4 NA   
    5 NA   
    6 NA   
    7 NA   
    8 NA   
    9 NA   
    > tibble(x = c(NA, NA, NA)) %>% left_join(., ., by = 'x', na_matches = 'never')
    # A tibble: 3 x 1
      x    
      <lgl>
    1 NA   
    2 NA   
    3 NA
    
  3. If order and values in c("year","month","MA","day") can be guaranteed to be the same then simple cbind or bind_cols might be an efficient solution

mys
  • 2,355
  • 18
  • 21