Merge two data frames while keeping the original row order

Question

I want to merge two data frames keeping the original row order of one of them (df.2 in the example below).

Here are some sample data (all values from class column are defined in both data frames):

df.1 <- data.frame(class = c(1, 2, 3), prob = c(0.5, 0.7, 0.3))
df.2 <- data.frame(object = c('A', 'B', 'D', 'F', 'C'), class = c(2, 1, 2, 3, 1))

If I do:

merge(df.2, df.1)

Output is:

  class object prob
1     1      B  0.5
2     1      C  0.5
3     2      A  0.7
4     2      D  0.7
5     3      F  0.3

If I add sort = FALSE:

merge(df.2, df.1, sort = F)

Result is:

  class object prob
1     2      A  0.7
2     2      D  0.7
3     1      B  0.5
4     1      C  0.5
5     3      F  0.3

But what I would like is:

  class object prob
1     2      A  0.7
2     1      B  0.5
3     2      D  0.7
4     3      F  0.3    
5     1      C  0.5

The problem is not well defined. What if you have, in the merged result, more or less rows than df.2? — agstudy, Jul 26 '13 at 10:16
Impossible because I use df.2 as first variable in merge function... — DJack, Jul 26 '13 at 10:23
No. it is wrong. still bad defined. take the case you have a class in df;2 that don't exist in df.1.For example : `df.2<-data.frame(object=c('A','B','D','F','C'), class=c(2,1,2,4,1))` — agstudy, Jul 26 '13 at 10:36
Ok maybe with the edit is more precised. Anyway, the PAC answer does the job I was looking for. Thx help me to improve the question. — DJack, Jul 26 '13 at 11:33
possible duplicate of [How can I merge and maintain the row order of one input?](http://stackoverflow.com/questions/11144551/how-can-i-merge-and-maintain-the-row-order-of-one-input) — Waldir Leoncio, Mar 27 '15 at 20:41

score 66 · Answer 1 · answered Jul 26 '13 at 09:54

You just need to create a variable which gives the row number in df.2. Then, once you have merged your data, you sort the new data set according to this variable. Here is an example :

df.1<-data.frame(class=c(1,2,3), prob=c(0.5,0.7,0.3))
df.2<-data.frame(object=c('A','B','D','F','C'), class=c(2,1,2,3,1))
df.2$id  <- 1:nrow(df.2)
out  <- merge(df.2,df.1, by = "class")
out[order(out$id), ]

score 53 · Accepted Answer · edited May 14 '14 at 22:22

53

Check out the join function in the plyr package. It's like merge, but it allows you to keep the row order of one of the data sets. Overall, it's more flexible than merge.

Using your example data, we would use join like this:

> join(df.2,df.1)
Joining by: class
  object class prob
1      A     2  0.7
2      B     1  0.5
3      D     2  0.7
4      F     3  0.3
5      C     1  0.5

Here are a couple of links describing fixes to the merge function for keeping the row order:

http://www.r-statistics.com/2012/01/merging-two-data-frame-objects-while-preserving-the-rows-order/

http://r.789695.n4.nabble.com/patching-merge-to-allow-the-user-to-keep-the-order-of-one-of-the-two-data-frame-objects-merged-td4296561.html

edited May 14 '14 at 22:22

joran

169,992
32
429
468

answered Jul 30 '13 at 18:31

user2635373

646
5
5

2

Show the solutions here, don't just reference them. Links can become invalid fairly fast on the internet.. – stefan Jul 30 '13 at 18:54
Each of these links is an in-depth discussion of the problem with the merge function, so I'll just leave them as links. (But instead I will again put in a plug for the join function in the plyr package.) – user2635373 Jul 30 '13 at 20:11
3

I (and `join`s documentation) would disagree that it is "more flexible" than `merge`: "Join is often faster than merge, although it is somewhat less featureful - it currently offers no way to rename output or merge on different variables in the x and y data frames." – joran May 14 '14 at 22:23
*"it allows you to keep the row order of one of the data sets"* To be more precise, as per `?join`: "Unlike merge, preserves the order of x no matter what join type is used. If needed, rows from y will be added to the bottom." – landroni Sep 01 '15 at 20:17

score 17 · Answer 3 · answered Jul 18 '14 at 07:51

You can also check out the inner_join function in Hadley's dplyr package (next iteration of plyr). It preserves the row order of the first data set. The minor difference to your desired solution is that it also preserves the original column order of the first data set. So it does not necessarily put the column we used for merging at the first position.

Using your example above, the inner_join result looks like this:

inner_join(df.2,df.1)
Joining by: "class"
  object class prob
1      A     2  0.7
2      B     1  0.5
3      D     2  0.7
4      F     3  0.3
5      C     1  0.5

score 11 · Answer 4 · edited Feb 01 '18 at 08:17

11

From data.table v1.9.5+, you can do:

require(data.table) # v1.9.5+
setDT(df.1)[df.2, on="class"]

The performs a join on column class by finding out matching rows in df.1 for each row in df.2 and extracting corresponding columns.

edited Feb 01 '18 at 08:17

Uwe

41,420
11
90
134

answered Aug 20 '15 at 17:33

Arun

116,683
26
284
387

1

Is it stated explicitly in the documentation anywhere that row order is preserved? – logworthy Jun 16 '18 at 11:09
1

It doesn't seem to be with some data I have. – JerryN Dec 08 '18 at 18:23

score 5 · Answer 5 · answered Feb 01 '18 at 08:14

For the sake of completeness, updating in a join preserves the original row order as well. This might be an alternative to Arun's data.table answer if there are only a few columns to append:

library(data.table)
setDT(df.2)[df.1, on = "class", prob := i.prob][]

   object class prob
1:      A     2  0.7
2:      B     1  0.5
3:      D     2  0.7
4:      F     3  0.3
5:      C     1  0.5

Here, df.2 is right joined to df.1 and gains a new column prob which is copied from the matching rows of df.1.

For multiple columns you can do `setDT(df.2)[df.1, on = "class", names(df.1) := mget(paste0("i.", names(df.1)))]` — David Arenburg, Jul 16 '20 at 10:37

score 3 · Answer 6 · edited May 23 '17 at 12:18

The accepted answer proposes a manual way to keep order when using merge, which works most of the times but requires unnecessary manual work. This solution comes on the back of How to ddply() without sorting?, which deals with the issue of keeping order but in a split-apply-combine context:

This came up on the plyr mailing list a while back (raised by @kohske no less) and this is a solution offered by Peter Meilstrup for limited cases:

#Peter's version used a function gensym to
# create the col name, but I couldn't track down
# what package it was in.
keeping.order <- function(data, fn, ...) { 
  col <- ".sortColumn"
  data[,col] <- 1:nrow(data) 
  out <- fn(data, ...) 
  if (!col %in% colnames(out)) stop("Ordering column not preserved by function") 
  out <- out[order(out[,col]),] 
  out[,col] <- NULL 
  out 
}

So now you can use this generic keeping.order function to keep the original row order of a merge call:

df.1<-data.frame(class=c(1,2,3), prob=c(0.5,0.7,0.3))
df.2<-data.frame(object=c('A','B','D','F','C'), class=c(2,1,2,3,1))
keeping.order(df.2, merge, y=df.1, by = "class")

Which will yield, as requested:

> keeping.order(df.2, merge, y=df.1, by = "class")
  class object id prob
3     2      A  1  0.7
1     1      B  2  0.5
4     2      D  3  0.7
5     3      F  4  0.3
2     1      C  5  0.5

So keeping.order effectively automates the approach in the accepted answer.

qed · Answer 7 · 2013-08-09T13:41:44.787

Thanks to @PAC , I came up with something like this:

merge_sameord = function(x, y, ...) {
    UseMethod('merge_sameord')
}

merge_sameord.data.frame = function(x, y, ...) {
    rstr = paste(sample(c(0:9, letters, LETTERS), 12, replace=TRUE), collapse='')
    x[, rstr] = 1:nrow(x)
    res = merge(x, y, all.x=TRUE, sort=FALSE, ...)
    res = res[order(res[, rstr]), ]
    res[, rstr] = NULL
    res
}

This assumes that you want to preserve the order the first data frame, and the merged data frame will have the same number of rows as the first data frame. It will give you the clean data frame without extra columns.

score 2 · Answer 8 · answered May 02 '20 at 13:48

For package developers

As a package developer, you want to be dependent on as few other packages as possible. Especially tidyverse functions, that change way too often for package developers IMHO.

To be able to make use of the join functions of the dplyr package without importing dplyr, below is a quick implementation. It keeps the original sorting (as requested by OP) and does not move the joining column to the front (which is another annoying thing of merge()).

left_join <- function(x, y, ...) {
  merge_exec(x = x, y = y, all.x = TRUE, ...)
}
right_join <- function(x, y, ...) {
  merge_exec(x = x, y = y, all.y = TRUE, ...)
}
inner_join <- function(x, y, ...) {
  merge_exec(x = x, y = y, all = TRUE, ...)
}
full_join <- function(x, y, ...) {
  merge_exec(x = x, y = y, ...)
}

# workhorse:
merge_exec <- function(x, y, ...) {
  # set index
  x$join_id_ <- 1:nrow(x)
  # do the join
  joined <- merge(x = x, y = y, sort = FALSE, ...)
  # get suffices (yes, I prefer this over suffixes)
  if ("suffixes" %in% names(list(...))) {
    suffixes <- list(...)$suffixes
  } else {
    suffixes <- c("", "")
  }
  # get columns names in right order, so the 'by' column won't be forced first
  cols <- unique(c(colnames(x), 
                   paste0(colnames(x), suffixes[1]), 
                   colnames(y), 
                   paste0(colnames(y), suffixes[2])))
  # get the original row and column index
  joined[order(joined$join_id),
         cols[cols %in% colnames(joined) & cols != "join_id_"]]
}

No prob! Though I later found out that `match()` is much, much faster than `merge()`. But both work :) — MS Berends, Feb 24 '22 at 19:53

score 1 · Answer 9 · answered Feb 26 '18 at 19:23

In this specific case you could us factor for a compact base solution:

df.2$prob = factor(df.2$class,labels=df.1$prob)

df.2
#   object class prob
# 1      A     2  0.7
# 2      B     1  0.5
# 3      D     2  0.7
# 4      F     3  0.3
# 5      C     1  0.5

Not a general solution however, it works if:

You have a lookup table containing unique values
You want to update a table, not create a new one
the lookup table is sorted by the merging column
The lookup table doesn't have extra levels
You want a left_join
If you're fine with factors

1 is not negotiable, for the rest we can do:

df.3  <- df.2 # deal with 2.
df.1b <- df.1[order(df.1$class),] # deal with 3
df.1b <- df.1b[df.1$class %in% df.2$class,] # deal with 4.
df.3$prob = factor(df.3$class,labels=df.1b$prob)
df.3 <- df3[!is.na(df.3$prob),] # deal with 5. if you want an `inner join`
df.3$prob <- as.numeric(as.character(df.3$prob)) # deal with 6.

score 1 · Answer 10 · edited Oct 16 '20 at 10:20

The highest rated answer does not produce what the Original Poster would like, i.e., "class" in column 1. If OP would allow switching column order in df.2, then here is a possible base R non-merge one-line answer:

df.1 <- data.frame(class = c(1, 2, 3), prob = c(0.5, 0.7, 0.3))  
df.2 <- data.frame(class = c(2, 1, 2, 3, 1), object = c('A', 'B', 'D', 'F', 'C'))  
cbind(df.2, df.1[match(df.2$class, df.1$class), -1, drop = FALSE])

I happen to like the information portrayed in the row.names. A complete one-liner that exactly duplicates the OP's desired outcome is

data.frame(cbind(df.2, df.1[match(df.2$class, df.1$class), -1, drop = FALSE]),
           row.names = NULL)

I agree with https://stackoverflow.com/users/4575331/ms-berends that the fewer dependencies of a package developer on another package (or "verse") the better because development paths frequently diverge over time.

Note: The one-liner above does not work when there are duplicates in df.1$class. This can be overcome sans merge with 'outer' and a loop, or more generally with Ms Berend's clever post-merge rescrambling code.

score 0 · Answer 11 · answered Feb 20 '19 at 11:49

There are several uses cases in which a simple subset will do:

# Use the key variable as row.names
row.names(df.1) = df.1$key

# Sort df.1 so that it's rows match df.2
df.3 = df.1[df.2$key, ]

# Create a data.frame with cariables from df.1 and (the sorted) df.2
df.4 = cbind(df.1, df.3)

This code will preserve df.2 and it's order and add only matching data from df.1

If only one variable is to be added, the cbind() ist not required:

row.names(df.1) = df.1$key
df.2$data = df.1[df.2$key, "data"]

score 0 · Answer 12 · answered Feb 12 '23 at 07:17

I had the same problem with it but I simply used a dummy vector c(1:5) applied to a new column 'num'

df.2 <- data.frame(object = c('A', 'B', 'D', 'F', 'C'), class = c(2, 1, 2, 3, 1))

df.2$num <- c(1:5) # This range you can order in the last step.

dfm <- merge(df.2, df.1) # merged

dfm <- dfm[order(dfm$num),] # ascending order

score -1 · Answer 13 · answered Aug 20 '15 at 17:25

There may be a more efficient way in base. This would be fairly simple to make into a function.

varorder <- names(mydata)  # --- Merge 
mydata <- merge(mydata, otherData, by="commonVar")
restOfvars <- names(mydata[!(names(mydata) %in% varorder)])

mydata[c(varorder,restOfvars)]

Merge two data frames while keeping the original row order

13 Answers13

For package developers

Linked

Related