31

I have two data frames:

df1
x1  x2
1   a
2   b
3   c
4   d

and

df2
x1  x2
2   zz
3   qq

I want to replace some of the values in df1$x2 with values in df2$x2 based on the conditional match between df1$x1 and df2$x2 to produce:

df1
x1  x2
1   a
2   zz
3   qq
4   d
C8H10N4O2
  • 18,312
  • 8
  • 98
  • 134
Mike
  • 761
  • 2
  • 7
  • 5

8 Answers8

28

use match(), assuming values in df1 are unique.

df1 <- data.frame(x1=1:4,x2=letters[1:4],stringsAsFactors=FALSE)
df2 <- data.frame(x1=2:3,x2=c("zz","qq"),stringsAsFactors=FALSE)

df1$x2[match(df2$x1,df1$x1)] <- df2$x2
> df1
  x1 x2
1  1  a
2  2 zz
3  3 qq
4  4  d

If the values aren't unique, use :

for(id in 1:nrow(df2)){
  df1$x2[df1$x1 %in% df2$x1[id]] <- df2$x2[id]
}
Joris Meys
  • 106,551
  • 31
  • 221
  • 263
  • Nice. I wrote the match with reversed arguments and couldn't figure out why it was more complicated than I thought it should be. I'll add my answer as well because it may help others to think about how the changing the order of arguments in match can make things easier or harder. – Aaron left Stack Overflow May 24 '11 at 14:56
  • Thanks Joris. I was working with 'match' but couldn't get it to work. – Mike May 24 '11 at 15:13
  • I have added a solution that will perform better in the case of non-unique values in df1. – C8H10N4O2 Jul 21 '17 at 15:18
8

We can use {powerjoin}, and handle the conflicting columns with coalesce_yx

library(powerjoin)
df1 <- data.frame(x1 = 1:4, x2 = letters[1:4], stringsAsFactors = FALSE)
df2 <- data.frame(x1 = 2:3, x2 = c("zz", "qq"), stringsAsFactors = FALSE)

power_left_join(df1, df2, by = "x1", conflict = coalesce_yx)
#>   x1 x2
#> 1  1  a
#> 2  2 zz
#> 3  3 qq
#> 4  4  d
moodymudskipper
  • 46,417
  • 11
  • 121
  • 167
7

The first part of Joris' answer is good, but in the case of non-unique values in df1, the row-wise for-loop will not scale well on large data.frames.

You could use a data.table "update join" to modify in place, which will be quite fast:

library(data.table)
setDT(df1); setDT(df2)
df1[df2, on = .(x1), x2 := i.x2]

Or, assuming you don't care about maintaining row order, you could use SQL-inspired dplyr:

library(dplyr)
union_all(
  inner_join( df1["x1"], df2 ), # x1 from df1 with matches in df2, x2 from df2
  anti_join(  df1, df2["x1"] )  # rows of df1 with no match in df2
) # %>% arrange(x1) # optional, won't maintain an arbitrary row order

Either of these will scale much better than the row-wise for-loop.

Frank
  • 66,179
  • 8
  • 96
  • 180
C8H10N4O2
  • 18,312
  • 8
  • 98
  • 134
  • The data.table idiom is `df1[df2, on=.(x1), x2 := i.x2 ]` -- modifies in place ("replace some of the values in df1$x2" as the OP asked) and doesn't require setting keys. It's similar to an update join from SQL. – Frank Jul 21 '17 at 15:23
  • @Frank yep you beat me to it. – C8H10N4O2 Jul 21 '17 at 15:25
  • 1
    Ok. `df1[df2, x2 := df2[,x2]]` is not the same thing, fyi. – Frank Jul 21 '17 at 15:25
  • @Frank True, yours is faster – C8H10N4O2 Jul 21 '17 at 15:37
  • 1
    @Frank looks like Hadley [decided not](https://github.com/tidyverse/dplyr/issues/1275) to implement update join in dplyr, which seems like a weakness in the package to me. – C8H10N4O2 Jul 21 '17 at 15:40
  • 1
    Yeah, I saw that. Hadley's reason for excluding them is pretty weak (saying he's sticking to pure SQL), since update joins exist in some flavors of SQL. It just comes down to the "grammar" he came up with not being flexible enough. – Frank Jul 21 '17 at 15:42
  • Another suggested edit: `bind_rows(semi_join(df2, df1, by="x1"), anti_join(df1, df2, by="x1"))` as maybe more idiomatic for dplyr. – Frank Sep 04 '18 at 15:51
  • FYI I wrote a package that supports update joins. For instance using function `safe_left_join` a `conflict` argument can be set to `"patch"` or to `dplyr::coalesce` depending on the desired effect. See an example in my answer here with another function (but with equivalent effect in this case). – moodymudskipper Mar 22 '19 at 17:42
5

I see that Joris and Aaron have both chosen to build examples without factors. I can certainly understand that choice. For the reader with columns that are already factors there would also be to option of coercion to "character". There is a strategy that avoids that constraint and which also allows for the possibility that there may be indices in df2 that are not in df1 which I believe would invalidate Joris Meys' but not Aaron's solutions posted so far:

df1 <- data.frame(x1=1:4,x2=letters[1:4])
df2 <- data.frame(x1=c(2,3,5), x2=c("zz", "qq", "xx") )

It requires that the levels be expanded to include the intersection of both factor variables and then also the need to drop non-matching columns (= NA values) in match(df1$x1, df2$x1)

 df1$x2 <- factor(df1$x2 , levels=c(levels(df1$x2), levels(df2$x2)) )
 df1$x2[na.omit(match(df2$x1,df1$x1))] <- df2$x2[which(df2$x1 %in% df1$x1)]
 df1
#-----------
  x1 x2
1  1  a
2  2 zz
3  3 qq
4  4  d

(Note that recent versions of R do not have stringsAsFactors set to TRUE in the data.frame function defaults, unlike it was for most of the history of R.)

IRTFM
  • 258,963
  • 21
  • 364
  • 487
4

It can be done with dplyr.

library(dplyr)

full_join(df1,df2,by = c("x1" = "x1")) %>% 
  transmute(x1 = x1,x2 = coalesce(x2.y,x2.x))

  x1 x2
1  1  a
2  2 zz
3  3 qq
4  4  d
Wil
  • 3,076
  • 2
  • 12
  • 31
4

You can do it by matching the other way too but it's more complicated. Joris's solution is better but I'm putting this here also as a reminder to think about which way you want to match.

df1 <- data.frame(x1=1:4, x2=letters[1:4], stringsAsFactors=FALSE)
df2 <- data.frame(x1=2:3, x2=c("zz", "qq"), stringsAsFactors=FALSE)
swap <- df2$x2[match(df1$x1, df2$x1)]
ok <- !is.na(swap)
df1$x2[ok] <- swap[ok]

> df1
  x1 x2
1  1  a
2  2 zz
3  3 qq
4  4  d
Aaron left Stack Overflow
  • 36,704
  • 7
  • 77
  • 142
3

new here, but using the following dplyr approach seems to work as well
similar but slightly different to one of the answers above

df3 <- anti_join(df1, df2, by = "x1")
df3 <- rbind(df3, df2)
df3
mdb_ftl
  • 423
  • 2
  • 14
0

As of dplyr 1.0.0 there is a function specifically for this:

library(dplyr)
df1 <- data.frame(x1=1:4,x2=letters[1:4],stringsAsFactors=FALSE)
df2 <- data.frame(x1=2:3,x2=c("zz","qq"),stringsAsFactors=FALSE)


rows_update(df1, df2, by = "x1")

See https://stackoverflow.com/a/65254214/2738526

Sarah
  • 3,022
  • 1
  • 19
  • 40