Create a "flag" column in a dataset based on a another table in R

Question

I have two datasets: dataset1 and dataset2.

zz <- "id_customer id_order order_date
1 1 2018-10
1 2 2018-11
2 3 2019-05
3 4 2019-06"

dataset1 <- read.table(text=zz, header=TRUE)

yy <- "id_customer order_date
1 2018-10
3 2019-06"

dataset2 <- read.table(text=yy, header=TRUE)

dataset2 is the result of a query where I have two columns: id_customer and date (format YYYY-mm). Those correspond to customers which have a different status than the others in the source dataset (dataset1), for a specified month.

dataset1 is a list of transactions where I have id_customer, id_order and date (format YYYY-mm as well). I want to enrich dataset1 with a "flag" column for each line set to 1 if the customer id appears in dataset2, during the corresponding month.

I have tried something as follows:

dataset$flag <- ifelse(dataset1$id_customer %in% dataset2$id_customer &
                         dataset1$date == dataset2$date,
                       "1", "0")

But I get a warning message that says 'longer object length is not a multiple of shorter object length'. I understand that but cannot come up with a solution. Could someone please help?

Welcome to Stack Overflow! As it stands, your question is hard to answer, since we don't have your data. Please create a [minimal reproducible example](https://stackoverflow.com/a/5963610/4303162) and add it to your question such that we can reproduce the problem and test possible solutions. — Stibu, Mar 09 '20 at 17:05
But I guess that `dataset1` and `dateset2` don't have the same number of rows. — Stibu, Mar 09 '20 at 17:09

score 1 · Accepted Answer · answered Mar 11 '20 at 06:24

You can add a flag to dataset2 then use merge(), keeping all rows from dataset1. Borrowing Chris' data:

dataset2$flag <- 1
merge(dataset1, dataset2, all.x = TRUE)

   ID    Date flag
1   1 2018-12   NA
2   1 2019-11   NA
3   2 2018-13   NA
4   2 2019-10   NA
5   2 2019-11    1
6   2 2019-12   NA
7   2 2019-12   NA
8   3 2018-12    1
9   3 2018-12    1
10  4 2018-13    1

Chris Ruehlemann · Answer 2 · 2020-03-11T06:01:43.453

0

EDIT: This seems to work:

Illustrative data:

set.seed(100)
dt1 <- data.frame(
  ID = sample(1:4, 10, replace = T),
  Date = paste0(sample(2018:2019, 10, replace = T),"-", sample(10:13, 10, replace = T))
)
dt1
   ID    Date
1   2 2019-12
2   2 2019-12
3   3 2018-12
4   1 2018-12
5   2 2019-11
6   2 2019-10
7   4 2018-13
8   2 2018-13
9   3 2018-12
10  1 2019-11

dt2 <- data.frame(
  ID = sample(1:4, 5, replace = T),
  Date = paste0(sample(2018:2019, 5, replace = T),"-", sample(10:13, 5, replace = T))
)
dt2
  ID    Date
1  2 2019-11
2  4 2018-13
3  2 2019-13
4  4 2019-13
5  3 2018-12

SOLUTION:

The solution uses ifelse to define a condition upon which to set the 'flag' 1(as specified in the OP). That condition implies a match between dt1and dt2; thus we're using match. A complicating factor is that the condition requires a double match between two columns in each dataframe. Therefore, we use apply to paste the rows in the two columns together using paste0 and search for matches in these compound strings:

dt1$flag <- ifelse(match(apply(dt1[,1:2], 1, paste0, collapse = " "), 
                         apply(dt2[,1:2], 1, paste0, collapse = " ")), 1, "NA")

RESULT:

dt1
   ID    Date flag
1   2 2019-12   NA
2   2 2019-12   NA
3   3 2018-12    1
4   1 2018-12   NA
5   2 2019-11    1
6   2 2019-10   NA
7   4 2018-13    1
8   2 2018-13   NA
9   3 2018-12    1
10  1 2019-11   NA

To check the results we can compare them with the results obtained from merge:

flagged_only <- merge(dt1, dt2)

flagged_only
  ID    Date
1  2 2019-11
2  3 2018-12
3  3 2018-12
4  4 2018-13

The dataframe flagged_onlycontains exactly the same four rows as the ones flagged 1 in dt1-- voilà!

edited Mar 11 '20 at 06:01

answered Mar 09 '20 at 17:34

Chris Ruehlemann

20,321
4
12
34

1

This seems to work and solve the fact that my two datasets don't have the same number of rows. Cheers! – Gregoire Maltès Mar 09 '20 at 17:41
Just after checking because I had weird outcomes, I realized that there is a problem: some rows are marked whereas they should not. It flags rows where IDs are actually in dataset2, but the dates are not – Gregoire Maltès Mar 09 '20 at 19:04
On my end it does work, even on larger datasets. You can also try reduplicating the `Date`value where the two dataframes match on the two conditions: `dataset1$flag2 <- dataset1$Date[match(dataset1$ID,dataset2$ID) & match(dataset1$Date, dataset2$Date)]`--this flags exactly the same rows. – Chris Ruehlemann Mar 09 '20 at 19:18
If you can provide a somewhat larger sample of your data I could perhaps help more efficiently. Also, have you made sure you have adapted your code correctly so that it matches mine? – Chris Ruehlemann Mar 09 '20 at 19:19
I really cannot figure out what goes wrong and continue to get the same issue... I could send you the two datasets so that you could see the problem by yourself maybe.. I sadly do not have enough reputation points to start a chat – Gregoire Maltès Mar 09 '20 at 20:17
Yeah, why not. I think it is not customary to share email addresses on SO but you can easily google me via my name plus Freiburg University. I have time tomorrow so I could have a peek at your data. – Chris Ruehlemann Mar 09 '20 at 21:46
Really glad it does! Took me some time to figure it out... Please do feel free to accept the answer! – Chris Ruehlemann Mar 11 '20 at 13:42

score 0 · Answer 3 · answered Sep 20 '22 at 10:46

It is very is to add a corresponding flag in a data.table way:

# Load library
library(data.table)

# Convert created tables to data.table object
setDT(dataset1)
setDT(dataset2)

# Add {0, 1} to dataset1 if the row can be found in dataset2
dataset1[, flag := 0][dataset2, flag := 1, on = .(id_customer, order_date)]

The result looks as follows:

> dataset1
   id_customer id_order order_date flag
1:           1        1    2018-10    1
2:           1        2    2018-11    0
3:           2        3    2019-05    0
4:           3        4    2019-06    1

A bit more manipulations would be needed if you would have the full date/time in the datasets.

Create a "flag" column in a dataset based on a another table in R

3 Answers3