Fastest way to merge large datasets

Asked Jun 27 '18 at 11:28

Active Jun 27 '18 at 11:28

Viewed 23 times

Reworking a current process that has two data frames.

DF1 - 65kish rows, 15 columns DF2 - 300kish rows, 270 columns

We are merging by zip as such:

  newdf <- merge(df1, df2, by.x = "ZipA", by.y = "ZipB")

This is slow and depending on what else is running at the moment on the EC2 instance, may terminate. Important note: Zips are NOT unique in each DF(this is by design) What other options would people recommend?

sqldf? data.table? sparklyr(We have a spark back-end setup but nobody uses it)?

Really at a loss here as to how to make this more efficient but I'm afraid we might just be stuck due to the constructs of the data.

asked Jun 27 '18 at 11:28

DataDog

You find benchmarks in the link. – Henrik Jun 27 '18 at 11:32
Look into this thread: [How to join (merge) data frames (inner, outer, left, right)?](https://stackoverflow.com/questions/1299871/how-to-join-merge-data-frames-inner-outer-left-right). There are multiple answers with different packages etc. – LAP Jun 27 '18 at 11:33
Don't know why this didn't come up in my search,this is perfect, thanks! – DataDog Jun 27 '18 at 12:07

Fastest way to merge large datasets

0 Answers0