1

I have two dataframes:

d1 <- data.frame(id=1,a=1,start_d1=10, end_d1=19)
d2 <- data.frame(id=1,b=2, start_d2=15, end_d2=24)

Actually, start and end are dates and the data frames are huge, but I want to keep it simple here. Now I want to join on id and then get new rows depending on where the respective start and end dates overlap. The result should be a dataframe d:

     id   | a  | b  | start | end    
    ----------------------------
        1 | 1  | NA | 10 | 14
        1 | 1  | 2  | 15 | 19  
        1 | NA | 2  | 20 | 24


What I am doing right now: I join d1and d2 on id to get d. Then I apply an explicit function I wrote to each row in d that splits the row into new ones with the new variables startand end depending on the overlap.

This is somewhat messy. Is there a simpler approach to my problem?

Claudio Moneo
  • 489
  • 1
  • 4
  • 10
  • Hi, What is the expected output .please ```dput()``` your expected output will be good. – Tushar Lad Aug 14 '20 at 09:53
  • Something like this ```merge```? ```d <- merge(d1,d2,by=c('id','start','end'),all=T)``` – Tushar Lad Aug 14 '20 at 10:09
  • @TusharLad No, this doesn´t give us a seperate row for the overlap of start and date – Claudio Moneo Aug 14 '20 at 10:12
  • Related: [identify consecutively overlapping segments in R](https://stackoverflow.com/questions/52091907/identify-consecutively-overlapping-segments-in-r) – Henrik Aug 14 '20 at 10:37
  • You could use `rbind` after removing all after `_` in names of your dataframes! – Duck Aug 14 '20 at 12:09
  • @Duck No, same problem as before, the overlap doesn´t get a seperate row – Claudio Moneo Aug 14 '20 at 12:28
  • @MF14 I did some sketchs but not sure if it is what you want: `a1 <- d1 %>% pivot_longer(cols = -c(1,2)) %>% separate(col = name,into = c('key','date')) %>% mutate(val=1:n()) %>% pivot_wider(names_from = key,values_from = value) %>% select(-c(date,val))` Then `a2 <- d2 %>% pivot_longer(cols = -c(1,2)) %>% separate(col = name,into = c('key','date')) %>% mutate(val=1:n()) %>% pivot_wider(names_from = key,values_from = value) %>% select(-c(date,val))` last `a1 %>% bind_rows(a2)` and loaded `tidyverse` – Duck Aug 14 '20 at 12:53

1 Answers1

0

I think, joining the two datasets and then applying a function row rise and then rbinding them with do.call seems like a reasonable approach.

I tried to come up with a dplyr approach, but it feels more like a abomination, than anything else.

d1 <- data.frame(id=c(1,2),a=1, start_d1=c(10,20), end_d1=c(19,29))
d2 <- data.frame(id=c(1,2),b=2, start_d2=c(15,25), end_d2=c(24,34))

full_join(d1,d2) %>% 
  rowwise() %>% 
  do(data.frame(id = .$id,
                all_seq = seq(.$start_d1, .$end_d2, by = 1),
                a = c(rep(.$a, length(seq(.$start_d1, .$end_d1, by =1))),
                      rep(NA, length(seq(.$end_d1+1, .$end_d2, by = 1)))),
                b = c(rep(NA, length(seq(.$start_d1+1, .$start_d2, by =1))),
                      rep(.$b, length(seq(.$start_d2, .$end_d2, by = 1)))))) %>% 
  mutate(classifier = case_when(!is.na(a) && is.na(b) ~ 1,
                                !is.na(a) && !is.na(b) ~ 2,
                                is.na(a) && !is.na(b) ~ 3
                                )) %>% 
  ungroup() %>% 
  group_by(id, classifier) %>% 
  summarise(start = min(all_seq),
            end = max(all_seq),
            a = unique(a),
            b = unique(b)) %>% 
  select(-classifier)

Joining, by = "id"
# A tibble: 6 x 5
# Groups:   id [2]
     id start   end     a     b
  <dbl> <dbl> <dbl> <dbl> <dbl>
1     1    10    14     1    NA
2     1    15    19     1     2
3     1    20    24    NA     2
4     2    20    24     1    NA
5     2    25    29     1     2
6     2    30    34    NA     2

The idea in the do-function is to produce a sequence of the whole period per id, and a and b where they have to be.

SebSta
  • 476
  • 2
  • 12