87

I want to merge two data frames keeping the original row order of one of them (df.2 in the example below).

Here are some sample data (all values from class column are defined in both data frames):

df.1 <- data.frame(class = c(1, 2, 3), prob = c(0.5, 0.7, 0.3))
df.2 <- data.frame(object = c('A', 'B', 'D', 'F', 'C'), class = c(2, 1, 2, 3, 1))

If I do:

merge(df.2, df.1)

Output is:

  class object prob
1     1      B  0.5
2     1      C  0.5
3     2      A  0.7
4     2      D  0.7
5     3      F  0.3

If I add sort = FALSE:

merge(df.2, df.1, sort = F)                                                        

Result is:

  class object prob
1     2      A  0.7
2     2      D  0.7
3     1      B  0.5
4     1      C  0.5
5     3      F  0.3

But what I would like is:

  class object prob
1     2      A  0.7
2     1      B  0.5
3     2      D  0.7
4     3      F  0.3    
5     1      C  0.5
double-beep
  • 5,031
  • 17
  • 33
  • 41
DJack
  • 4,850
  • 3
  • 21
  • 45
  • The problem is not well defined. What if you have, in the merged result, more or less rows than df.2? – agstudy Jul 26 '13 at 10:16
  • Impossible because I use df.2 as first variable in merge function... – DJack Jul 26 '13 at 10:23
  • No. it is wrong. still bad defined. take the case you have a class in df;2 that don't exist in df.1.For example : `df.2<-data.frame(object=c('A','B','D','F','C'), class=c(2,1,2,4,1))` – agstudy Jul 26 '13 at 10:36
  • Ok maybe with the edit is more precised. Anyway, the PAC answer does the job I was looking for. Thx help me to improve the question. – DJack Jul 26 '13 at 11:33
  • possible duplicate of [How can I merge and maintain the row order of one input?](http://stackoverflow.com/questions/11144551/how-can-i-merge-and-maintain-the-row-order-of-one-input) – Waldir Leoncio Mar 27 '15 at 20:41

13 Answers13

66

You just need to create a variable which gives the row number in df.2. Then, once you have merged your data, you sort the new data set according to this variable. Here is an example :

df.1<-data.frame(class=c(1,2,3), prob=c(0.5,0.7,0.3))
df.2<-data.frame(object=c('A','B','D','F','C'), class=c(2,1,2,3,1))
df.2$id  <- 1:nrow(df.2)
out  <- merge(df.2,df.1, by = "class")
out[order(out$id), ]
PAC
  • 5,178
  • 8
  • 38
  • 62
53

Check out the join function in the plyr package. It's like merge, but it allows you to keep the row order of one of the data sets. Overall, it's more flexible than merge.

Using your example data, we would use join like this:

> join(df.2,df.1)
Joining by: class
  object class prob
1      A     2  0.7
2      B     1  0.5
3      D     2  0.7
4      F     3  0.3
5      C     1  0.5

Here are a couple of links describing fixes to the merge function for keeping the row order:

http://www.r-statistics.com/2012/01/merging-two-data-frame-objects-while-preserving-the-rows-order/

http://r.789695.n4.nabble.com/patching-merge-to-allow-the-user-to-keep-the-order-of-one-of-the-two-data-frame-objects-merged-td4296561.html

joran
  • 169,992
  • 32
  • 429
  • 468
user2635373
  • 646
  • 5
  • 5
  • 2
    Show the solutions here, don't just reference them. Links can become invalid fairly fast on the internet.. – stefan Jul 30 '13 at 18:54
  • Each of these links is an in-depth discussion of the problem with the merge function, so I'll just leave them as links. (But instead I will again put in a plug for the join function in the plyr package.) – user2635373 Jul 30 '13 at 20:11
  • 3
    I (and `join`s documentation) would disagree that it is "more flexible" than `merge`: "Join is often faster than merge, although it is somewhat less featureful - it currently offers no way to rename output or merge on different variables in the x and y data frames." – joran May 14 '14 at 22:23
  • *"it allows you to keep the row order of one of the data sets"* To be more precise, as per `?join`: "Unlike merge, preserves the order of x no matter what join type is used. If needed, rows from y will be added to the bottom." – landroni Sep 01 '15 at 20:17
17

You can also check out the inner_join function in Hadley's dplyr package (next iteration of plyr). It preserves the row order of the first data set. The minor difference to your desired solution is that it also preserves the original column order of the first data set. So it does not necessarily put the column we used for merging at the first position.

Using your example above, the inner_join result looks like this:

inner_join(df.2,df.1)
Joining by: "class"
  object class prob
1      A     2  0.7
2      B     1  0.5
3      D     2  0.7
4      F     3  0.3
5      C     1  0.5
alex23lemm
  • 5,475
  • 1
  • 21
  • 23
11

From data.table v1.9.5+, you can do:

require(data.table) # v1.9.5+
setDT(df.1)[df.2, on="class"]

The performs a join on column class by finding out matching rows in df.1 for each row in df.2 and extracting corresponding columns.

Uwe
  • 41,420
  • 11
  • 90
  • 134
Arun
  • 116,683
  • 26
  • 284
  • 387
5

For the sake of completeness, updating in a join preserves the original row order as well. This might be an alternative to Arun's data.table answer if there are only a few columns to append:

library(data.table)
setDT(df.2)[df.1, on = "class", prob := i.prob][]
   object class prob
1:      A     2  0.7
2:      B     1  0.5
3:      D     2  0.7
4:      F     3  0.3
5:      C     1  0.5

Here, df.2 is right joined to df.1 and gains a new column prob which is copied from the matching rows of df.1.

Uwe
  • 41,420
  • 11
  • 90
  • 134
3

The accepted answer proposes a manual way to keep order when using merge, which works most of the times but requires unnecessary manual work. This solution comes on the back of How to ddply() without sorting?, which deals with the issue of keeping order but in a split-apply-combine context:

This came up on the plyr mailing list a while back (raised by @kohske no less) and this is a solution offered by Peter Meilstrup for limited cases:

#Peter's version used a function gensym to
# create the col name, but I couldn't track down
# what package it was in.
keeping.order <- function(data, fn, ...) { 
  col <- ".sortColumn"
  data[,col] <- 1:nrow(data) 
  out <- fn(data, ...) 
  if (!col %in% colnames(out)) stop("Ordering column not preserved by function") 
  out <- out[order(out[,col]),] 
  out[,col] <- NULL 
  out 
} 

So now you can use this generic keeping.order function to keep the original row order of a merge call:

df.1<-data.frame(class=c(1,2,3), prob=c(0.5,0.7,0.3))
df.2<-data.frame(object=c('A','B','D','F','C'), class=c(2,1,2,3,1))
keeping.order(df.2, merge, y=df.1, by = "class")

Which will yield, as requested:

> keeping.order(df.2, merge, y=df.1, by = "class")
  class object id prob
3     2      A  1  0.7
1     1      B  2  0.5
4     2      D  3  0.7
5     3      F  4  0.3
2     1      C  5  0.5

So keeping.order effectively automates the approach in the accepted answer.

Community
  • 1
  • 1
landroni
  • 2,902
  • 1
  • 32
  • 39
2

Thanks to @PAC , I came up with something like this:

merge_sameord = function(x, y, ...) {
    UseMethod('merge_sameord')
}

merge_sameord.data.frame = function(x, y, ...) {
    rstr = paste(sample(c(0:9, letters, LETTERS), 12, replace=TRUE), collapse='')
    x[, rstr] = 1:nrow(x)
    res = merge(x, y, all.x=TRUE, sort=FALSE, ...)
    res = res[order(res[, rstr]), ]
    res[, rstr] = NULL
    res
}

This assumes that you want to preserve the order the first data frame, and the merged data frame will have the same number of rows as the first data frame. It will give you the clean data frame without extra columns.

qed
  • 22,298
  • 21
  • 125
  • 196
2

For package developers

As a package developer, you want to be dependent on as few other packages as possible. Especially tidyverse functions, that change way too often for package developers IMHO.

To be able to make use of the join functions of the dplyr package without importing dplyr, below is a quick implementation. It keeps the original sorting (as requested by OP) and does not move the joining column to the front (which is another annoying thing of merge()).

left_join <- function(x, y, ...) {
  merge_exec(x = x, y = y, all.x = TRUE, ...)
}
right_join <- function(x, y, ...) {
  merge_exec(x = x, y = y, all.y = TRUE, ...)
}
inner_join <- function(x, y, ...) {
  merge_exec(x = x, y = y, all = TRUE, ...)
}
full_join <- function(x, y, ...) {
  merge_exec(x = x, y = y, ...)
}

# workhorse:
merge_exec <- function(x, y, ...) {
  # set index
  x$join_id_ <- 1:nrow(x)
  # do the join
  joined <- merge(x = x, y = y, sort = FALSE, ...)
  # get suffices (yes, I prefer this over suffixes)
  if ("suffixes" %in% names(list(...))) {
    suffixes <- list(...)$suffixes
  } else {
    suffixes <- c("", "")
  }
  # get columns names in right order, so the 'by' column won't be forced first
  cols <- unique(c(colnames(x), 
                   paste0(colnames(x), suffixes[1]), 
                   colnames(y), 
                   paste0(colnames(y), suffixes[2])))
  # get the original row and column index
  joined[order(joined$join_id),
         cols[cols %in% colnames(joined) & cols != "join_id_"]]
}
MS Berends
  • 4,489
  • 1
  • 40
  • 53
1

In this specific case you could us factor for a compact base solution:

df.2$prob = factor(df.2$class,labels=df.1$prob)

df.2
#   object class prob
# 1      A     2  0.7
# 2      B     1  0.5
# 3      D     2  0.7
# 4      F     3  0.3
# 5      C     1  0.5

Not a general solution however, it works if:

  1. You have a lookup table containing unique values
  2. You want to update a table, not create a new one
  3. the lookup table is sorted by the merging column
  4. The lookup table doesn't have extra levels
  5. You want a left_join
  6. If you're fine with factors

1 is not negotiable, for the rest we can do:

df.3  <- df.2 # deal with 2.
df.1b <- df.1[order(df.1$class),] # deal with 3
df.1b <- df.1b[df.1$class %in% df.2$class,] # deal with 4.
df.3$prob = factor(df.3$class,labels=df.1b$prob)
df.3 <- df3[!is.na(df.3$prob),] # deal with 5. if you want an `inner join`
df.3$prob <- as.numeric(as.character(df.3$prob)) # deal with 6.
moodymudskipper
  • 46,417
  • 11
  • 121
  • 167
1

The highest rated answer does not produce what the Original Poster would like, i.e., "class" in column 1. If OP would allow switching column order in df.2, then here is a possible base R non-merge one-line answer:

df.1 <- data.frame(class = c(1, 2, 3), prob = c(0.5, 0.7, 0.3))  
df.2 <- data.frame(class = c(2, 1, 2, 3, 1), object = c('A', 'B', 'D', 'F', 'C'))  
cbind(df.2, df.1[match(df.2$class, df.1$class), -1, drop = FALSE])

I happen to like the information portrayed in the row.names. A complete one-liner that exactly duplicates the OP's desired outcome is

data.frame(cbind(df.2, df.1[match(df.2$class, df.1$class), -1, drop = FALSE]),
           row.names = NULL)

I agree with https://stackoverflow.com/users/4575331/ms-berends that the fewer dependencies of a package developer on another package (or "verse") the better because development paths frequently diverge over time.

Note: The one-liner above does not work when there are duplicates in df.1$class. This can be overcome sans merge with 'outer' and a loop, or more generally with Ms Berend's clever post-merge rescrambling code.

ah bon
  • 9,293
  • 12
  • 65
  • 148
Dan Murphy
  • 51
  • 1
0

There are several uses cases in which a simple subset will do:

# Use the key variable as row.names
row.names(df.1) = df.1$key

# Sort df.1 so that it's rows match df.2
df.3 = df.1[df.2$key, ]

# Create a data.frame with cariables from df.1 and (the sorted) df.2
df.4 = cbind(df.1, df.3)

This code will preserve df.2 and it's order and add only matching data from df.1

If only one variable is to be added, the cbind() ist not required:

row.names(df.1) = df.1$key
df.2$data = df.1[df.2$key, "data"]
BurninLeo
  • 4,240
  • 4
  • 39
  • 56
0

I had the same problem with it but I simply used a dummy vector c(1:5) applied to a new column 'num'

df.2 <- data.frame(object = c('A', 'B', 'D', 'F', 'C'), class = c(2, 1, 2, 3, 1))

df.2$num <- c(1:5) # This range you can order in the last step.

dfm <- merge(df.2, df.1) # merged

dfm <- dfm[order(dfm$num),] # ascending order
-1

There may be a more efficient way in base. This would be fairly simple to make into a function.

varorder <- names(mydata)  # --- Merge 
mydata <- merge(mydata, otherData, by="commonVar")
restOfvars <- names(mydata[!(names(mydata) %in% varorder)])

mydata[c(varorder,restOfvars)]
James Holland
  • 1,102
  • 10
  • 17