0

First, this question is NOT about

Error: cannot allocate vector of size n

I accept this error as a given and I am trying to avoid the error in code

  • I have a dataset of 3000+ variables and 120000 cases

  • All columns are numeric

  • I need to reset NA with zero

  • If I reassign values to 0 for the entire dataset, I get the memory allocation error.

  • So I am reassigning the values to zero one column at a time:`

    resetNA  <- function(results)
    {
       for (i in 1:ncol(results))
       {
              if(i>10)
              {
                      results[,i][is.na(results[,i])] <- 0
              }
       }
        print(head(results))
    }
    

After about 1000 columns, I still get the memory allocation error.

Now, this seems strange to me. Somehow memory allocation is incrementing after each loop. However, I don't see why this would be the case.

Also, I tried calling garbage collection function after each loop, I still got the memory allocation error.

Can someone explain to me how I can manage the variables to avoid the incremental increase in memory allocation (after all, the data frame size has not changed).

Jake
  • 4,322
  • 6
  • 39
  • 83
  • 1
    split your data into lists, apply the function to the lists, then recombine – B Williams May 11 '17 at 19:23
  • Thanks, I'll give it a try. But can you explain why this loop causes incremental memory allocation? – Jake May 11 '17 at 19:30
  • also, please feel free to post the answer and I will upvote. I hate posting answers to my own questions, but hate leaving a question unanswered even more – Jake May 11 '17 at 19:31
  • 1
    this may help http://stackoverflow.com/questions/7235657/fastest-way-to-replace-nas-in-a-large-data-table – B Williams May 11 '17 at 19:46
  • 1
    Every assignment operation will create at a minimum of 2 copies of the entire object. sometimes garbage collection needs to be called explicitly: `?gc` – IRTFM May 11 '17 at 20:06

2 Answers2

0

As noted in the comments above, the answer is here: Fastest way to replace NAs in a large data.table

I tried it and it works very well

Community
  • 1
  • 1
Jake
  • 4,322
  • 6
  • 39
  • 83
0

I have learned an important general principle about r memory usage.

See this discussion.

Whereever possible avoid looping through a dataframe. Use lapply. This converts a dataframe to a list and then runs the relevant function on the list. It then returns a list. Convert the list back to a dataframe.

The following example recodes numeric frequencies to a categorical variable. It is fast and does not increase memory usage.

    list1<-lapply(mybigdataframe,function(x) ifelse( x>0,"Yes","No"))
    newdf1<-as.data.frame(list1)
Jake
  • 4,322
  • 6
  • 39
  • 83