Is there a multivariate version of the function %in% in R? That is can one use multiple columns when comparing and overwriting dataframes?

Question

I am working with customer sales data which consists of two datasets. The first set is an overview of the customerID and the type of customer (A or B) with the corresponding turnover. I also have an external set which has the same characteristics as the first one. I need to overwrite the turnover of the first dataset by the turnover of the second dataset based on the combination of customerID and type of customer in the second set. I have the following as an example code:

ID <- c(1,2,3,3,4,5,6,7,7,8,9,10,11,11,12,12,13,14,15)
Type <- c("A","A","A","B","A","A","A","A","B","A","A","A","A","B","A","B","A","A","A")
Turnover <- seq(100,1900,100)

data1 <- as.data.frame(cbind(ID,Type,Turnover))

ID2 <- c(3,7,11,12)
Type2 <- c("B","A","A","A")
Turnover2 <- c(150,450,600,750)

data2 <- as.data.frame(cbind(ID2,Type2,Turnover2))

My first idea was to make use of the %in% function in the following manner:

data1[data1$ID %in% data2$ID2 & data1$Type %in% data2$Type2, "Turnover"] <- data2[data1$ID %in% data2$ID2 & data1$Type %in% data2$Type2, "Turnover2"]

But then I only obtain NAs and if I get a turnover, then it tends to ignore the combination of ID and Type and only focuses on ID. Is there a smart and clever way to overcome this and make use of %in% function in a multidimensional way, i.e., based on more columns? So I want to keep the first dataset as a whole, but only overwrite the turnover for the ID and Type of customer combination that are also present in the second set.

Welcome to SO! Could you provide the expected output as well? — Maël, Jan 03 '22 at 20:06

AndrewGB · Answer 1 · 2022-01-03T20:17:23.717

One way is to use rows_update (though is currently in the "experimental" phase) from dplyr. We can use by to specify the columns to match, then update the other column(s). However, the column names must be the same between the dataframes. So, here I just remove the 2 from the end of the column names in data2, so that they are the same.

library(tidyverse)

colnames(data2) <- gsub('.{1}$', '', colnames(data2))

data1 %>% 
    rows_update(data2, by = c("ID", "Type"))

Output

   ID Type Turnover
1   1    A      100
2   2    A      200
3   3    A      300
4   3    B      150
5   4    A      500
6   5    A      600
7   6    A      700
8   7    A      450
9   7    B      900
10  8    A     1000
11  9    A     1100
12 10    A     1200
13 11    A      600
14 11    B     1400
15 12    A      750
16 12    B     1600
17 13    A     1700
18 14    A     1800
19 15    A     1900

Data

data1 <- structure(list(ID = c("1", "2", "3", "3", "4", "5", "6", "7",
                               "7", "8", "9", "10", "11", "11", "12", "12", "13", "14", "15"),
                        Type = c("A", "A", "A", "B", "A", "A", "A", "A", "B", "A",
                                 "A", "A", "A", "B", "A", "B", "A", "A", "A"),
                        Turnover = c("100", "200", "300", "400", "500", "600", "700", "800", "900", "1000",
                                    "1100", "1200", "1300", "1400", "1500", "1600", "1700", "1800",
                                    "1900")), class = "data.frame", row.names = c(NA, -19L))

data2 <- structure(list(ID2 = c("3", "7", "11", "12"), 
                        Type2 = c("B", "A", "A", "A"), 
                        Turnover2 = c("150", "450", "600", "750")), 
                   class = "data.frame", row.names = c(NA, -4L))

score 0 · Answer 2 · answered Jan 03 '22 at 20:16

The rows_update function looks like a good way to do this, but is (as of Jan 2022) in "experimental" phase. You could use a join + coalesce from dplyr to link the tables and to preferentially use Turnover2, with Turnover as a backup.

library(dplyr)
data1 %>%
  left_join(data2, by = c("ID" = "ID2", "Type" = "Type2")) %>%
  mutate(Turnover_update = coalesce(Turnover2, Turnover))

score 0 · Answer 3 · answered Jan 03 '22 at 20:16

This seems to be a merge/join operation:

base R

newdata <- merge(data1, data2, by.x = c("ID", "Type"), by.y = c("ID2", "Type2"), all.x = TRUE)
newdata
#    ID Type Turnover Turnover2
# 1   1    A      100      <NA>
# 2  10    A     1200      <NA>
# 3  11    A     1300       600
# 4  11    B     1400      <NA>
# 5  12    A     1500       750
# 6  12    B     1600      <NA>
# 7  13    A     1700      <NA>
# 8  14    A     1800      <NA>
# 9  15    A     1900      <NA>
# 10  2    A      200      <NA>
# 11  3    A      300      <NA>
# 12  3    B      400       150
# 13  4    A      500      <NA>
# 14  5    A      600      <NA>
# 15  6    A      700      <NA>
# 16  7    A      800       450
# 17  7    B      900      <NA>
# 18  8    A     1000      <NA>
# 19  9    A     1100      <NA>
newdata$Turnover <- with(newdata, ifelse(is.na(Turnover2), Turnover, Turnover2))
newdata$Turnover2 <- NULL
newdata
#    ID Type Turnover
# 1   1    A      100
# 2  10    A     1200
# 3  11    A      600
# 4  11    B     1400
# 5  12    A      750
# 6  12    B     1600
# 7  13    A     1700
# 8  14    A     1800
# 9  15    A     1900
# 10  2    A      200
# 11  3    A      300
# 12  3    B      150
# 13  4    A      500
# 14  5    A      600
# 15  6    A      700
# 16  7    A      450
# 17  7    B      900
# 18  8    A     1000
# 19  9    A     1100

dplyr

library(dplyr)
newdata <- left_join(data1, data2, by = c(ID = "ID2", Type = "Type2")) %>%
  mutate(Turnover = coalesce(Turnover2, Turnover)) %>%
  select(-Turnover2)

data.table

library(data.table)
DT1[DT2, Turnover := Turnover2, on = .(ID = ID2, Type = Type2)]

Is there a multivariate version of the function %in% in R? That is can one use multiple columns when comparing and overwriting dataframes?

3 Answers3

base R

dplyr

data.table