0

I need to update a large data frame (a) several times a day with a smaller data frame (b). Both data frames share the same number of variables and the same structure of column classes. The only difference is the number of observations and the observations themselves. I've spent hours on this site and others trying to figure out a solution. I finally got merge() to work for small data frames (313 observations of 96 variables each).

merge(a,b,all=T)

However, when I try to run the same operation on my larger data frame (~1.5 million observations of 96 variables), I get

Error: cannot allocate vector of size 1.6 Gb.

I have ~12 Gb of Free Physical Memory according to Windows Task Manager. I am starting with the gc() function to make sure I have the most amount of memory possible, but it still won't work. Is there another function that will simply add observations to an existing data frame? I tried a few others, but the result wasn't a data frame with the same structure.

If you can't tell already, I'm new to R (and this forum). I started to learn Stata and someone convinced me to move to R before I got too deep. In Stata, this was an easy operation

clear

use a

append using b

That got the job done without any issues in Stata and it was quick (less than a few seconds).

Can someone please help? Thanks!

Community
  • 1
  • 1
JPG
  • 1
  • 1
  • Try with package `dplyr`, using `dplyr:inner_join()`. It should be more efficient and faster for a big merge operation. – Alex Oct 21 '14 at 00:18
  • also `gc()` is not a good way of clearing memory. I recommend restarting your rsession before trying to do the merge. While doing the merge, also look at your memory usage in the windows task manager. – Alex Oct 21 '14 at 00:19

2 Answers2

0

See the answer to this question: R: how to rbind two huge data-frames without running out of memory

If you data is in a SQLite database, you can use the sqldf package. Otherwise, you should look into using the data.table package.

Community
  • 1
  • 1
Tom
  • 694
  • 8
  • 20
0

I ended up going with a relatively simple solution and breaking down the larger data frame into a few smaller data frames and just used rbind(a,b). While this took some extra code on my part to handle the new data structure, it allowed me to work with much smaller data frames and actually sped up processing time by removing all extraneous data when it wasn't needed.

JPG
  • 1
  • 1