0

I got an R data frame and tried to find rows where, simultaneously, the entries of the column data_recalls$Attributed_Brand match and the time periods overlap. The time periods can be found in the columns Before_Date_Recall and After_Date_Recall. Possible matches would look like this:

(1) data_recalls$Attributed_Brand: Nike; Before_Date_Recall: 2018-09-22; After_Date_Recall: 2018-10-21

(2) data_recalls$Attributed_Brand: Nike; Before_Date_Recall: 2018-10-19; After_Date_Recall: 2018-10-24

Unfortunately, I can only come up with a solution where the code looks for matches in the column data_recalls$Attributed_Brand and, after that, for overlapping dates, which does not create the results needed. My data frame looks like this

data_recalls <- data.frame(Attributed_Brand = c("Nike", "Adidas", "Nike", "Puma"),
                       Before_Date_Recall = c("2018-09-22", "2018-09-20", "2018-10-19", "2018-11-01"),
                       After_Date_Recall = c("2018-10-21", "2018-10-24", "2018-10-31", "2018-11-10"),
                       stringsAsFactors = FALSE)

Thanks for any suggestions!

  • 2
    It sounds like a non-equi merge, using one of `data.table`'s merge (always), `dplyr::join_by` (dplyr >= 1.1.0), `fuzzyjoin::` or `sqldf::`. Lacking any sample data or anything else, I think that's the most I think I can suggest. See https://stackoverflow.com/q/64362881/3358272, https://stackoverflow.com/q/64539945/3358272. If you want more help _here_, I suggest a reproducible question with sample data and code attempted, please read https://stackoverflow.com/q/5963269, [mcve], and https://stackoverflow.com/tags/r/info. – r2evans Mar 22 '23 at 21:49
  • Thanks for your reply! I provided some sample data. The code attempted is: `library(dplyr) data_recalls %>% group_by(Attributed_Brand) %>% filter( any(Before_Date_Recall <= max(Before_Date_Recall)) & any(After_Date_Recall >= min(After_Date_Recall))` Anyway, I couldn't find any overlapping dates, where the brands of data_recalls$Attributed_Brand matched. – Niklas Schröder Mar 22 '23 at 23:44

2 Answers2

0

Sounds like the ivs package would work for you. The groups column allows you to identify overlapping rows within each brand name.

library(ivs)
library(dplyr, warn.conflicts = FALSE)

data_recalls <- tibble(
  Attributed_Brand = c(
    "Nike", "Nike", "Nike", 
    "Adidas", "Adidas", "Adidas", "Adidas"
  ),
  Before_Date_Recall = as.Date(c(
    "2018-09-22", "2018-10-19", "2018-10-26",
    "2018-09-01", "2018-10-05", "2018-09-03", "2018-10-02"
  )),
  After_Date_Recall = as.Date(c(
    "2018-10-21", "2018-10-24", "2018-10-30",
    "2018-09-05", "2018-10-15", "2018-09-07", "2018-10-10"
  ))
)

data_recalls %>%
  mutate(Date_Recall = iv(Before_Date_Recall, After_Date_Recall), .keep = "unused") %>%
  mutate(group = iv_identify_group(Date_Recall), .by = Attributed_Brand) %>%
  filter(n() > 1, .by = c(Attributed_Brand, group))
#> # A tibble: 6 × 3
#>   Attributed_Brand              Date_Recall                    group
#>   <chr>                          <iv<date>>               <iv<date>>
#> 1 Nike             [2018-09-22, 2018-10-21) [2018-09-22, 2018-10-24)
#> 2 Nike             [2018-10-19, 2018-10-24) [2018-09-22, 2018-10-24)
#> 3 Adidas           [2018-09-01, 2018-09-05) [2018-09-01, 2018-09-07)
#> 4 Adidas           [2018-10-05, 2018-10-15) [2018-10-02, 2018-10-15)
#> 5 Adidas           [2018-09-03, 2018-09-07) [2018-09-01, 2018-09-07)
#> 6 Adidas           [2018-10-02, 2018-10-10) [2018-10-02, 2018-10-15)

And here is one other alternate way that may be faster depending on the structure of your actual dataset

data_recalls %>%
  mutate(Date_Recall = iv(Before_Date_Recall, After_Date_Recall), .keep = "unused") %>%
  mutate(count = iv_count_overlaps(Date_Recall, Date_Recall), .by = Attributed_Brand) %>%
  filter(count > 1)
#> # A tibble: 6 × 3
#>   Attributed_Brand              Date_Recall count
#>   <chr>                          <iv<date>> <int>
#> 1 Nike             [2018-09-22, 2018-10-21)     2
#> 2 Nike             [2018-10-19, 2018-10-24)     2
#> 3 Adidas           [2018-09-01, 2018-09-05)     2
#> 4 Adidas           [2018-10-05, 2018-10-15)     2
#> 5 Adidas           [2018-09-03, 2018-09-07)     2
#> 6 Adidas           [2018-10-02, 2018-10-10)     2
Davis Vaughan
  • 2,780
  • 9
  • 19
  • Thanks for your reply! According to your exemplary data set, my preferred data set with matches would look like this: `df_matches <- data.frame( Attributed_Brand = c("Adidas", "Adidas", "Adidas", "Adidas", "Nike", "Nike"), Before_Date_Recall = as.Date(c("2018-10-05", "2018-10-02", "2018-09-03", "2018-09-01", "2018-10-19", "2018-09-22")), After_Date_Recall = as.Date(c("2018-10-15", "2018-10-10", "2018-09-07", "2018-09-05", "2018-10-24", "2018-10-21")))` The third line of your data wouldn't be a match, because the interval from 2018-10-26 to 2018-10-30 didn't overlap with other Nike dates – Niklas Schröder Mar 22 '23 at 23:45
  • I've updated with an extra `filter()` that should result in what you're looking for – Davis Vaughan Mar 23 '23 at 12:55
0

Try this:

sqldf::sqldf("
  select distinct t1.*
  from data_recalls t1
    inner join data_recalls t2 on t1.Attributed_Brand = t2.Attributed_Brand
      and ((t1.Before_Date_Recall > t2.Before_Date_Recall and t1.Before_Date_Recall < t2.After_Date_Recall)
           or (t1.After_Date_Recall > t2.Before_Date_Recall and t1.After_Date_Recall < t2.After_Date_Recall))")
#   Attributed_Brand Before_Date_Recall After_Date_Recall
# 1             Nike         2018-09-22        2018-10-21
# 2             Nike         2018-10-19        2018-10-31

I tried first using sql's between, but that's inclusive, which causes every row to match itself. We could likely safeguard against that with some added portions, but this is a start.

In order to allow overlap, your data needs to have a unique id for each row:

data_recalls$id <- seq_len(nrow(data_recalls))

From there, we can use between and then add inequality of the id field.

sqldf::sqldf("
  select distinct t1.*
  from data_recalls t1
    inner join data_recalls t2 on t1.Attributed_Brand = t2.Attributed_Brand
      and (t1.Before_Date_Recall between t2.Before_Date_Recall and t2.After_Date_Recall
           or t1.After_Date_Recall between t2.Before_Date_Recall and t2.After_Date_Recall)
      and t1.id <> t2.id")
#   Attributed_Brand Before_Date_Recall After_Date_Recall id
# 1             Nike         2018-09-22        2018-10-21  1
# 2             Nike         2018-10-19        2018-10-31  3
r2evans
  • 141,215
  • 6
  • 77
  • 149
  • The code worked for the provided sample data. However, if I got a data frame like the following one, the code does not work: `data_recalls <- data.frame(Attributed_Brand = c("Adidas", "Adidas", "Nike", "Nike"), Before_Date_Recall = c("2018-09-22", "2018-09-20", "2018-11-02", "2018-11-03"), After_Date_Recall = c("2018-10-21", "2018-10-24", "2018-11-10", "2018-11-09"), stringsAsFactors = FALSE)` In this case, the results should like the data frame, since the brand names match and the dates overlap. – Niklas Schröder Mar 23 '23 at 12:16
  • Perhaps https://stackoverflow.com/a/74679562/3358272 – r2evans Mar 23 '23 at 12:27