0

Reworking a current process that has two data frames.

DF1 - 65kish rows, 15 columns DF2 - 300kish rows, 270 columns

We are merging by zip as such:

  newdf <- merge(df1, df2, by.x = "ZipA", by.y = "ZipB")

This is slow and depending on what else is running at the moment on the EC2 instance, may terminate. Important note: Zips are NOT unique in each DF(this is by design) What other options would people recommend?

sqldf? data.table? sparklyr(We have a spark back-end setup but nobody uses it)?

Really at a loss here as to how to make this more efficient but I'm afraid we might just be stuck due to the constructs of the data.

DataDog
  • 475
  • 1
  • 9
  • 23
  • You find benchmarks in the link. – Henrik Jun 27 '18 at 11:32
  • Look into this thread: [How to join (merge) data frames (inner, outer, left, right)?](https://stackoverflow.com/questions/1299871/how-to-join-merge-data-frames-inner-outer-left-right). There are multiple answers with different packages etc. – LAP Jun 27 '18 at 11:33
  • Don't know why this didn't come up in my search,this is perfect, thanks! – DataDog Jun 27 '18 at 12:07

0 Answers0