Merge dataframes of different sizes

Question

I have two data frames d1 and d2 respectively as:

I want something like:

I am really sorry for the trivial question, I could not get the answer.

just remove the first, third, and last rows of d1 and rbind(d1, d2) — MLavoie, Dec 23 '15 at 15:01
Could you elaborate what happened with 10,10,7? Do first two columns make a "key" and third one is a "value"? — mlt, Dec 23 '15 at 15:02
Dear MLavoie, It is just an example. The real data is bigger. — Pankaj, Dec 23 '15 at 15:03
Dear mlt, x and y are kind of longitude and latitude and z is value — Pankaj, Dec 23 '15 at 15:04
I mean can it be multivalued for a given x&y? And consider [dput and this faq](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) — mlt, Dec 23 '15 at 15:05
One approach (not most efficient) is to have a list with "x;y" (converted to charater and separated by semicolon) as a key and use [modifyList](https://stat.ethz.ch/R-manual/R-devel/library/utils/html/modifyList.html). — mlt, Dec 23 '15 at 15:10
Dear mlt, please provide a reproducible example. It will be really helpful. — Pankaj, Dec 23 '15 at 15:12

Jaap · Accepted Answer · 2020-05-04T13:26:40.217

9

From your description I understand that you want to replace the z values in d1 with the z values in d2 when x & y match.

Using base R:

d3 <- merge(d1, d2, by = c("x","y"), all.x = TRUE)
d3[is.na(d3$z.y),"z.y"] <- d3[is.na(d3$z.y),"z.x"]
d3 <- d3[,-3]
names(d3)[3] <- "z"

which gives:

> d3
   x  y   z
1 10 10 100
2 10 12   6
3 11 10 200
4 11 12   2
5 12 10   1
6 12 12 400

Using the data.table-package:

library(data.table)

setDT(d1) # convert the data.frame to a data.table
setDT(d2) # idem

# join the two data.table's and replace the values
d1[d2, on = .(x, y), z := i.z]

or in one go:

setDT(d1)[setDT(d2), on = .(x, y), z := i.z]

which gives:

> d1
    x  y   z
1: 10 10 100
2: 10 12   6
3: 11 10 200
4: 11 12   2
5: 12 10   1
6: 12 12 400

Using the dplyr package:

d3 <- left_join(d1, d2, by = c("x","y")) %>%
  mutate(z.y = ifelse(is.na(z.y), z.x, z.y)) %>%
  select(-z.x) %>%
  rename(z = z.y)

Since release 0.5.0 you can also use the coalesce-function for this (thx to Laurent Hostert for bringing it to my attention):

d3 <- left_join(d1, d2, by = c("x","y")) %>% 
  mutate(z = coalesce(z.y, z.x)) %>% 
  select(-c(z.x, z.y))

edited May 04 '20 at 13:26

answered Dec 23 '15 at 15:12

Jaap

81,064
34
182
193

@Pankaj thanx; I added also a `dplyr` alternative – Jaap Dec 23 '15 at 16:04
@Jaap I've added my answer expanding your example, see below – Severin Pappadeux Dec 23 '15 at 16:15
@Jaap In dplyr release v0.5.0 (https://github.com/tidyverse/dplyr/releases/tag/v0.5.0) the `coalesce` function was added. This can simplify your dplyr solution: `d3 <- left_join(d1, d2, by = c("x","y")) %>% mutate(z = coalesce(z.y, z.x)) %>% select(-c(z.x, z.y))` – Lauren Hostert Apr 26 '20 at 20:06
@LaurenHostert Thx for notifying! I've included it in my answer now. – Jaap May 04 '20 at 13:27

score 3 · Answer 2 · edited Jan 20 '18 at 17:05

3

On top of excellent @Jaap answer wrt data.table

In data.table, you could join using keys, and with data.table anything done via using keys is the fastest option around. You could even have different column names, see modified example below.

Along the lines, data:

and code:

library(data.table)

d1 <- fread("d1.csv", sep=" ")
d2 <- fread("d2.csv", sep=" ")

# here is data.table keys magic
# note different column names
setkey(d1, x, q)
setkey(d2, x, y)

q <- d2[d1][is.na(z), z := i.z][, i.z := NULL]

print(q)

result:

    x  y   z
1: 10 10 100
2: 10 12   6
3: 11 10 200
4: 11 12   2
5: 12 10   1
6: 12 12 400

edited Jan 20 '18 at 17:05

h3rm4n

4,126
15
21

answered Dec 23 '15 at 16:13

Severin Pappadeux

18,636
3
38
64

Dear Severin Pappadeux, thanks for improving suggestions. – Pankaj Dec 23 '15 at 16:22
You can also join with different column names when using `on`: `d2[d1, on=c("x"="x","y"="q")]`. With the [development version of `data.table` (1.9.7)](https://github.com/Rdatatable/data.table/wiki/Installation) you can shorten this to `d2[d1, on=c("x","y"="q")]`. Not sure about whether setting keys is faster than using `on`. Did you test it? – Jaap Dec 23 '15 at 16:53
Nice, but this returns an entirely new data.table first and then updates that -- all just to add a column (= memory inefficient). See Jaap's updated answer. – Arun Dec 24 '15 at 11:17

user295691 · Answer 3 · 2015-12-23T15:26:28.600

It sounds like you want to ensure that there is only one z value for each x and y value. The main question is how you choose which z value to associate with it. From the description, I'm guessing you either want the second data frame to always override, or you want the maximum value to be taken.

Start with the raw data:

df1 <- structure(list(x = c(10L, 10L, 11L, 11L, 12L, 12L), y = c(10L, 12L, 10L, 12L, 10L, 12L), z = c(7L, 6L, 8L, 2L, 1L, 5L)), .Names = c("x", "y", "z"), class = "data.frame", row.names = c(NA, -6L))
df2 <- structure(list(x = 10:12, y = c(10L, 10L, 12L), z = c(100L, 200L,400L)), .Names = c("x", "y", "z"), class = "data.frame", row.names = c(NA,-3L))

If it's the max you want, then you probably want to simply combine the two frames, and then extract the max for each x and y:

merged.df <- aggregate(z ~ x + y, data = rbind(df1, df2), max)

If, instead, you want the second data frame to override the first, then you would aggregate using the last value to match

merged.df <- aggregate(z ~ x+ y, data=rbind(df1, df2), function(d) tail(d, n=1))

If you have many columns besides z, then I can only assume that you want the latter behavior. For this, you're better off using a library like data.table or dplyr. In dplyr, it would look like this

require(dplyr)
merged.df <- rbind(df1, df2) %>% group_by(x, y) %>% summarise_each(funs(last))

With data.table it would look like

require(data.table)
merged.df <- setDT(rbind(df1, df2))[, lapply(.SD, last), .(x,y)]

Dear user295691, Thanks a lot for sharing useful information. — Pankaj, Dec 23 '15 at 15:58

GKi · Answer 4 · 2019-07-15T11:34:08.567

Alternatively to use merge this update-join can be done in base by using match and which to find the indexes used for sub-setting the tables and interaction to make one key vector out of two.

With this way, neither the order nor the size of d1 will be changed. In case a key in d2 is represented twice, the first occurrence will be used to update d1.

d1 <- read.table(header=TRUE, text="x   y  z
10  10 7
10  12 6
11  10 8
11  12 2
12  10 1
12  12 5")
d2 <- read.table(header=TRUE, text="x  y  z
10 10 100
11 10 200
12 12 400")

key <- c("x", "y") #define which columns are used as matching key
idx <- match(interaction(d2[key]), interaction(d1[key])) #find where it matches
d1$z[idx] <- d2$z #make the update

d1 #show result
#   x  y   z
#1 10 10 100
#2 10 12   6
#3 11 10 200
#4 11 12   2
#5 12 10   1
#6 12 12 400

Alternative you can check if there was a match between d1 and d2 like:

idx <- match(interaction(d1[key]), interaction(d2[key]))
idxn <- which(!is.na(idx)) #find where it does not match
d1$z[idxn] <- d2$z[idx[idxn]]

or

idx <- match(interaction(d2[key]), interaction(d1[key]))
idxn <- which(!is.na(idx))
d1$z[idx[idxn]] <- d2$z[idxn]

Merge dataframes of different sizes

4 Answers4

Linked

Related