Merging tables that are sorted differently with R

Question

I have 2 big tables. One with identifiers (unsorted), another with a list of identifiers (containing all which are in the first table) plus the associated values for one variable. I want to add a column to my first table containing the associated values that are in the second table. Is there an smart way to proceed using implemented functions of R ?

i.e.

table 1
id
8979786
62782
6268768
6776566

table 2
id        var
1          5
2          2
3          NA
…
9999999    6

and the result should be

table1
id       var
8979786   5
62782     NA
6268768   7
4776566   4

Thanks in advance

Chase · Answer 1 · 2011-04-22T14:44:33.307

So the id column is in both tables? You can merge them together: merge(table1, table2, sort = FALSE). There are lots of options to explore for merge that let you emulate different types of joins, similar to inner, left, right, and outer joins in SQL. I added the additional parameter sort here to preserve the original order of table1.

If there are ids in table1 but not table 2 and you want to show NAs for those, add all.x = TRUE as a parameter. This is equivalent to a left join. all.y is a right join, and all = TRUE is equivalent to a full outer join.

Reproducible example:

> set.seed(1)
> table1 <- data.frame( id = sample(1:5, 5, FALSE))
> table1
  id
1  2
2  5
3  4
4  3
5  1
> table2 <- data.frame( id = 1:5, var = rnorm(5))
> table2
  id        var
1  1  1.2724293
2  2  0.4146414
3  3 -1.5399500
4  4 -0.9285670
5  5 -0.2947204
> merge(table1, table2, sort = FALSE)
  id        var
1  2  0.4146414
2  5 -0.2947204
3  4 -0.9285670
4  3 -1.5399500
5  1  1.2724293

or `table2[match(table1$id,table2$id),]` or even just `table2[table1$id,]` if id really goes from 1 to n. — Joris Meys, Apr 22 '11 at 15:10
merge generalises to more id columns much better than match. — hadley, Apr 22 '11 at 15:56

score 3 · Answer 2 · answered Jan 17 '12 at 14:11

Here is a data.table way of doing this, in case the data is big and speed is an issue. For more information, refer to the help page of ?data.table:

When i is a data.table, x (that is the outer data.table) must have a key. i (that is the inner data.table) is joined to x using the key and the rows in x that match are returned. An equi-join is performed between each column in i to each column in x's key. The match is a binary search in compiled C in O(log n) time. If i has less columns than x's key then many rows of x may match to each row of i. If i has more columns than x's key, the columns of i not involved in the join are included in the result. If i also has a key, it is i's key columns that are used to match to x's key columns and a binary merge
of the two tables is carried out.

Note that I adjusted the sample data provided by Chase a little to make certain points about the matching in data.table more obvious:

require(data.table)
#Version 1.7.7
set.seed(1)
table1 <- data.table(id = sample(3:7, 5, FALSE), var1 = rnorm(5), key="id")
table2 <- data.table(id = 5:10, var2 = rnorm(6), key="id")

#Default: If id in table 1 is not in table 2, return NA
table2[table1]
#      id         var2       var1
# [1,]  3           NA -0.2947204
# [2,]  4           NA  1.2724293
# [3,]  5 -0.005767173 -0.9285670
# [4,]  6  2.404653389 -1.5399500
# [5,]  7  0.763593461  0.4146414

#If one wants to get rid of the NAs
table2[table1, nomatch=0]
#      id         var2       var1
# [1,]  5 -0.005767173 -0.9285670
# [2,]  6  2.404653389 -1.5399500
# [3,]  7  0.763593461  0.4146414

#Or the other way around: get all ids of table 2
table1[table2]
#      id       var1         var2
# [1,]  5 -0.9285670 -0.005767173
# [2,]  6 -1.5399500  2.404653389
# [3,]  7  0.4146414  0.763593461
# [4,]  8         NA -0.799009249
# [5,]  9         NA -1.147657009
# [6,] 10         NA -0.289461574

The obligatory speed test:

set.seed(10)
df1 <- data.frame(id = sample(1:5e6, 5e6, FALSE))
df2 <- data.frame(id = sample(1:5e6, 5e6, FALSE), var = rnorm(5e6))
system.time(df_solution <- merge(df1, df2, sort = TRUE))
#    user  system elapsed 
#   33.10    0.32   33.54
merge_dt <- function(df1, df2) {
  dt1 <- setkey(as.data.table(df1), "id")
  dt2 <- setkey(as.data.table(df2), "id")
  return(dt1[dt2])
}
system.time(dt_solution <- merge_dt(df1, df2))
#    user  system elapsed 
#   12.94    0.01   12.95 
all.equal(df_solution, as.data.frame(dt_solution))
#[1] TRUE

And my usual disclaimer: I'm still learning a lot about this package as well, so you find better information at the package homepage.

score 2 · Answer 3 · answered Jan 15 '12 at 11:22

2

I just implemented a function that solves this issue (of merging two data.frame objects while keeping the order to be by one of the two objects), you can see the code and examples for its use here:

http://www.r-statistics.com/2012/01/merging-two-data-frame-objects-while-preserving-the-rows-order/

answered Jan 15 '12 at 11:22

Tal Galili

24,605
44
129
187

Hi Tal. Please don't forget data.table. – Matt Dowle Jan 16 '12 at 15:26
Sorry Matthew :) You are invited to leave a comment here (or on my post) on how to solve this problem with data.table! – Tal Galili Jan 16 '12 at 21:49

Merging tables that are sorted differently with R

3 Answers3