Idiomatic R method for "left joining" two data frames

Question

I have two data frames that both have a column containing a factor like the following:

> head(test.data)
      var0     var1       date  store
1 109.5678 109.5678 1990-03-30 Store1
2 109.3009 108.4261 1990-06-30 Store1
3 108.8262 106.2517 1990-09-30 Store1
4 108.2443 108.6417 1990-12-30 Store1
5 109.5678 109.5678 1991-03-30 Store1
6 109.3009 108.4261 1991-06-30 Store1
> summary(test.data)
      var0             var1              date                store   
 Min.   : -9.72   Min.   : -2.297   Min.   :1990-03-30   Store1 : 8  
 1st Qu.: 68.32   1st Qu.: 71.305   1st Qu.:1990-09-07   Store2 : 8  
 Median :102.19   Median :101.192   Median :1991-02-13   Store3 : 8  
 Mean   :101.09   Mean   :103.042   Mean   :1991-02-13   Store4 : 8  
 3rd Qu.:151.22   3rd Qu.:151.940   3rd Qu.:1991-07-23   Store5 : 8  
 Max.   :196.55   Max.   :201.099   Max.   :1991-12-30   Store6 : 8  
                                                         (Other):48  
>
> head(test.clusters)
   store cluster
1 Store1       A
2 Store2       C
3 Store3       A
4 Store4       B
5 Store5       D
6 Store6       A
>
> summary(test.clusters)
     store   cluster
 Store1 :1   A:5    
 Store2 :1   B:4    
 Store3 :1   C:2    
 Store4 :1   D:1    
 Store5 :1          
 Store6 :1          
 (Other):6          
>

I want to add a column to test.data containing each row's cluster, based on their shared store. I am currently doing this using a doubly nested loop:

new_col <- rep(test.clusters$cluster[1], length(test.data$store))
for (i in seq(test.data$store)){
  for (j in seq(test.clusters$store)){
    if (test.data$store[i] == test.clusters$store[j]){
      new_col[i] <- test.clusters$cluster[j]
      break
    }       
  }
}
test.data$cluster <- new_col

This is very verbose, grossly inefficient and frankly ugly. What is the idiomatic method for doing this in R?

score 8 · Answer 1 · answered Oct 06 '10 at 12:37

8

Use the join function from plyr. It uses match internally, so should be fast (4-10x faster than merge).

answered Oct 06 '10 at 12:37

hadley

102,019
32
183
245

Note that this only works with the current version of plyr. If you're stuck on an older version of R, for whatever reason, this won't be an option. – Harlan Oct 06 '10 at 15:26

score 7 · Answer 2 · answered Oct 06 '10 at 10:58

7

Use merge function.

answered Oct 06 '10 at 10:58

mbq

18,510
6
49
72

3

+1 See also http://stackoverflow.com/questions/3515611/merge-command-in-r and http://stackoverflow.com/questions/2232699/r-how-to-do-a-data-table-merge-operation for example, or do a search on "merge [r]" on the site. – Joris Meys Oct 06 '10 at 11:04
I'll look into it - didn't know what keyword to search for! – fmark Oct 06 '10 at 11:49

score 2 · Answer 3 · answered Oct 06 '10 at 12:18

2

I recommend match. It will be much faster then merge.

It should look like this:

test.data$cluster <- test.clusters$cluster[match(test.data$store, test.clusters$store)]

answered Oct 06 '10 at 12:18

Marek

49,472
15
99
121

Idiomatic R method for "left joining" two data frames

3 Answers3