7

I have two dataframes:

id      dates
MUM-1  2015-07-10
MUM-1  2015-07-11
MUM-1  2015-07-12
MUM-2  2014-01-14
MUM-2  2014-01-15
MUM-2  2014-01-16
MUM-2  2014-01-17

and:

id      dates      field1  field2
MUM-1  2015-07-10     1       0
MUM-1  2015-07-12     2       1
MUM-2  2014-01-14     4       3
MUM-2  2014-01-17     0       1

merged data:

id      dates        field1   field2
MUM-1  2015-07-10      1         0
MUM-1  2015-07-11      na        na
MUM-1  2015-07-12      2         1
MUM-2  2014-01-14      4         3
MUM-2  2014-01-15      na        na
MUM-2  2014-01-16      na        na
MUM-2  2014-01-17      0         1   

code: merge(x= df1, y= df2, by= 'id', all.x= T)

I am using merge but since the size of both dataframes are too huge, it is taking too long to process. Is there any alternative to the merge function? Maybe in dplyr? So that it processes fast in comparision. Both dataframes have more than 900K rows.

Jaap
  • 81,064
  • 34
  • 182
  • 193
Dheeraj Singh
  • 715
  • 1
  • 12
  • 24

4 Answers4

18

Instead of using merge with data.table, you can also simply join as follows:

setDT(df1)
setDT(df2)

df2[df1, on = c('id','dates')]

this gives:

> df2[df1]
      id      dates field1 field2
1: MUM-1 2015-07-10      1      0
2: MUM-1 2015-07-11     NA     NA
3: MUM-1 2015-07-12      2      1
4: MUM-2 2014-01-14      4      3
5: MUM-2 2014-01-15     NA     NA
6: MUM-2 2014-01-16     NA     NA
7: MUM-2 2014-01-17      0      1

Doing this with dplyr:

library(dplyr)
dplr <- left_join(df1, df2, by=c("id","dates"))

As mentioned by @Arun in the comments, a benchmark is not very meaningfull on a small dataset with seven rows. So lets create some bigger datasets:

dt1 <- data.table(id=gl(2, 730, labels = c("MUM-1", "MUM-2")),
                  dates=c(seq(as.Date("2010-01-01"), as.Date("2011-12-31"), by="days"),
                          seq(as.Date("2013-01-01"), as.Date("2014-12-31"), by="days")))
dt2 <- data.table(id=gl(2, 730, labels = c("MUM-1", "MUM-2")),
                  dates=c(seq(as.Date("2010-01-01"), as.Date("2011-12-31"), by="days"),
                          seq(as.Date("2013-01-01"), as.Date("2014-12-31"), by="days")),
                  field1=sample(c(0,1,2,3,4), size=730, replace = TRUE),
                  field2=sample(c(0,1,2,3,4), size=730, replace = TRUE))
dt2 <- dt2[sample(nrow(dt2), 800)]

As can be seen, @Arun's approach is slightly faster:

library(rbenchmark)
benchmark(replications = 10, order = "elapsed", columns = c("test", "elapsed", "relative"),
          jaap = dt2[dt1, on = c('id','dates')],
          pavo = merge(dt1,dt2,by="id",allow.cartesian=T),
          dplr = left_join(dt1, dt2, by=c("id","dates")),
          arun = dt1[dt2, c("fiedl1", "field2") := .(field1, field2), on=c("id", "dates")])

  test elapsed relative
4 arun   0.015    1.000
1 jaap   0.016    1.067
3 dplr   0.037    2.467
2 pavo   1.033   68.867

For a comparison on a large dataset, see the answer of @Arun.

Community
  • 1
  • 1
Jaap
  • 81,064
  • 34
  • 182
  • 193
10

I'd update df1 directly by reference as follows:

require(data.table) # v1.9.5+
setDT(df1)[df2, c("fiedl1", "field2") := 
                .(field1, field2), on=c("id", "dates")]

> df1
#       id      dates fiedl1 field2
# 1: MUM-1 2015-07-10      1      0
# 2: MUM-1 2015-07-11     NA     NA
# 3: MUM-1 2015-07-12      2      1
# 4: MUM-2 2014-01-14      4      3
# 5: MUM-2 2014-01-15     NA     NA
# 6: MUM-2 2014-01-16     NA     NA
# 7: MUM-2 2014-01-17      0      1

This'd be very memory efficient (and faster), as it doesn't copy the entire object just to add two columns, rather updates in place.


Updated with a slightly bigger dataset than @Jaap's updated benchmark:

set.seed(1L)
dt1 = CJ(id1 = paste("MUM", 1:1e4, sep = "-"), id2 = sample(1e3L))
dt2 = dt1[sample(nrow(dt1), 1e5L)][, c("field1", "field2") := lapply(c(1e3L, 1e4L), sample, 1e5L, TRUE)][]

# @Jaap's answers
system.time(ans1 <- setDT(dt2)[dt1, on = c('id1','id2')])
#    user  system elapsed 
#   0.209   0.067   0.277 
system.time(ans2 <- left_join(setDF(dt1), setDF(dt2), by = c("id1", "id2")))
#    user  system elapsed 
# 119.911   0.530 120.749 

# this answer    
system.time(ans3 <- setDT(dt1)[dt2, c("field1", "field2") := list(field1, field2), on = c("id1", "id2")])
#    user  system elapsed 
#   0.087   0.013   0.100 

sessionInfo()
# R version 3.2.1 (2015-06-18)
# Platform: x86_64-apple-darwin13.4.0 (64-bit)
# Running under: OS X 10.10.4 (Yosemite)

# locale:
# [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

# attached base packages:
# [1] stats     graphics  grDevices utils     datasets  methods   base     

# other attached packages:
# [1] data.table_1.9.5 dplyr_0.4.2     

# loaded via a namespace (and not attached):
# [1] magrittr_1.5   R6_2.1.0       assertthat_0.1 parallel_3.2.1 DBI_0.3.1     
# [6] tools_3.2.1    Rcpp_0.12.0    chron_2.3-45  

Dint expect dplyr to be ~1200x slower though.

Arun
  • 116,683
  • 26
  • 284
  • 387
  • That's indeed (slightly) faster. Added it to the benchmarks in my answer. – Jaap Aug 07 '15 at 13:28
  • 1
    @VeerendraGadekar, `dt` is an entirely new object, allocated separately in memory. It's unnecessary in most cases. The concept of *joins* in data.table is easy to understand as extension from subsets.. For ex: `X[a %in% 1:5, b := val]` and `X[Y, b := val]` aren't that different, except we are used to calling it subset and joins. – Arun Aug 07 '15 at 13:38
3

You can convert both data frames to data tables, and then perform a merge:

library(data.table)
setDT(df1); setDT(df2)

merge(df1, df2, by = "id", allow.cartesian = TRUE)

the allow.cartesian part allows a merge when there are repeated values in the key of any of the merged elements (allowing for a new table length greater than the max of the orignal elements, see ?data.table.

Jaap
  • 81,064
  • 34
  • 182
  • 193
PavoDive
  • 6,322
  • 2
  • 29
  • 55
1

I think the quickest solution at the moment for those cases (huge datasets) is the data.table merging after setting the keys first.

You can also use dplyr's left_join with data.frames, as mentioned before, but it would be good to compare the same command after transforming your data.frames to data.tables. In other words, use dplyr with a data.table structure in the background.

As an example I'll create two datasets, then save them as a data.frame, data.table with a key and data.table without a key. Then I'll perform various merges and count the time:

library(data.table)
library(dplyr)

# create and save this dataset as a data.frame and as a data.table
list = seq(1,500000)
random_number = rnorm(500000,10,5)

dataT11 = data.table(list, random_number, key="list") # data.table with a key
dataT12 = data.table(list, random_number) # data.table without key
dataF1 = data.frame(list, random_number)

# create and save this dataset as a data.frame and as a data.table
list = seq(1,500000)
random_number = rnorm(500000,10,5)

dataT21 = data.table(list, random_number, key="list")
dataT22 = data.table(list, random_number)
dataF2 = data.frame(list, random_number)


# check your current data tables (note some have keys)
tables()


# merge the datasets as data.frames and count time
ptm <- proc.time()
dataF3 = merge(dataF1, dataF2, all.x=T)
proc.time() - ptm


# merge the datasets as data.tables by setting the key now and count time
ptm <- proc.time()
dataT3 = merge(dataT12, dataT22, all.x=T, by = "list")
proc.time() - ptm


# merge the datasets as data.tables on the key they have already and count time
ptm <- proc.time()
dataT3 = merge(dataT11, dataT21, all.x=T)
proc.time() - ptm

# merge the datasets as data.tables on the key they have already and count time (alternative)
ptm <- proc.time()
dataT3 = dataT11[dataT21]
proc.time() - ptm



# merge the datasets as data.frames using dplyr and count time
ptm <- proc.time()
dataT3 = dataF1 %>% left_join(dataF2, by="list")
proc.time() - ptm


# merge the datasets as data.tables using dplyr and count time
ptm <- proc.time()
dataT3 = dataT11 %>% left_join(dataT21, by="list")
proc.time() - ptm
AntoniosK
  • 15,991
  • 2
  • 19
  • 32
  • 2
    It's always good to post code showing what your approach is. Vistors (and OPs) to SO come looking for _code_!. Could you please elaborate (in code) what you meant by "data.table merging" and "setting keys". That will get the OP out of his problem – PavoDive Aug 07 '15 at 11:51
  • I think that the type of data, number of rows and columns are some of the factors that affect speed. So, if that merging process has to be automated (maybe within a bigger process) we should try various merging alternatives and decide which is the best for our case. My example has only 2 columns with numeric values. I don't think I can generalise. – AntoniosK Aug 07 '15 at 12:42
  • 1
    Indeed it's hard to find out because I didn't show any results and I apologise for that. I prefer showing the code that someone can re-run on his own machine with his own data without my results (my data, my machine). I've seen that -in my industry- it makes people believe that what they see in an example is the (universal) optimal case and they don't bother running their own code (or variations of the example code). – AntoniosK Aug 07 '15 at 13:24