0

I have two data frames--one is huge (over 2 million rows) and one is smaller (around 300,000 rows). The smaller data frame is a subset of the larger one. The only difference is that the larger one has an additional attribute that I need to add to the smaller one.

Specifically, the attributes for the large data frame are (Date, Time, Address, Flag) and the attributes for the small data frame are (Date, Time, Address). I need to get the correct corresponding Flag value somehow into the smaller data frame for each row. The final size of the "merged" data frame should be the same as my smaller one, discarding the unused rows from the large data frame.

What is the best way to accomplish this?

Update: I tested the merge function with the following:

new<-merge(data12, data2, by.x = c("Date", "Time", "Address"), 
           by.y=c("Date", "Time", "Address"))

and

new<-merge(data12, data2, by = c("Date", "Time", "Address"))

both return an empty data frame (new) with the right number of attributes as well as the following warning message:

Warning message:In `[<-.factor`(`*tmp*`, ri, value = c(15640, 15843, 15843, 15161,  : invalid factor level, NAs generated
Dr. Mike
  • 2,451
  • 4
  • 24
  • 36
whistler
  • 876
  • 2
  • 15
  • 31

2 Answers2

5
    R> df1 = data.frame(a = 1:5, b = rnorm(5))
    R> df1
      a           b
    1 1 -0.09852819
    2 2 -0.47658118
    3 3 -2.14825893
    4 4  0.82216912
    5 5 -0.36285430
    R> df2 = data.frame(a = 1:10000, c = rpois(10000, 6))
    R> head(df2)
      a c
    1 1 2
    2 2 4
    3 3 5
    4 4 3
    5 5 3
    6 6 8
    R> merge(df1, df2)
      a           b c
    1 1 -0.09852819 2
    2 2 -0.47658118 4
    3 3 -2.14825893 5
    4 4  0.82216912 3
    5 5 -0.36285430 3
Jake Burkhead
  • 6,435
  • 2
  • 21
  • 32
  • I tried the simple merge example you posted, but on my data and got the same warning message I posted in my question. Again, the returned data frame was empty. I have no idea what is causing the problem with my data, but I get how to do the merge now. Have you seen such a warning error before? – whistler Jul 13 '13 at 02:49
  • Nevermind! I figured out why I'm getting that error. One of the columns has a different data type! – whistler Jul 13 '13 at 02:52
  • @whistler. Congrats! See this for future reference http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – Jake Burkhead Jul 13 '13 at 02:53
3

Perhaps plyr is a more intuitive package for this operation. What you need is a SQL inner join. I believe this approach is clearer than merge().

Here is a simple example of how you would use join() with data sets of your size.

library(plyr)

id = c(1:2000000)
rnormal <- rnorm(id)
rbinom <- rbinom(2000000, 5,0.5)


df1 <- data.frame(id, rnormal, rbinom)
df2 <- data.frame(id = id[1:300000], rnormal = rnormal[1:300000])

You would like to add rbinom to df2

joined.df <- join(df1, df2, type = "inner")

Here is the performance of join() vs merge()

system.time(joined.df <- join(df1, df2, type = "inner"))
Joining by: id, rnormal
   user  system elapsed 
  22.44    0.53   22.80 
system.time(merged.df <- merge(df1, df2))
   user  system elapsed 
 26.212   0.605  30.201 
marbel
  • 7,560
  • 6
  • 49
  • 68