Search string sequentially in R

Question

id1   id2    date

101   NA      01.1.2021     
102   101     12.1.2021
103   102     17.1.2021
104   103     18.1.2021
105   NA      25.1.2021
106   NA      03.1.2021
107   NA      10.1.2021
108   107     11.1.2021
109   NA      09.1.2021

I have two id variables in a data. I need to search for string sequentially.

Search 101 (id1[1]) in id2, if 101 present in id2, then continue and return 102 (id1). Again it will search for 102 in id2 and if 102 present in id2, then continue and return 103. The process will continue and will stop when id1 is not present in id2.

So output will be:

[[1]] 01.1.2021, 12.1.2021, 17.1.2021, 18.1.2021

Similarly 2nd output for id 107 will be:

[[2]] 10.1.2021, 11.1.2021

Could you add a table with your desired output? I am not exactly sure if you are asking. — koolmees, Jul 09 '21 at 08:32
The output will be a vector of dates. Check output [[1]] and [[2]] — Williams86, Jul 09 '21 at 08:39
You may check `igraph::components`. See e.g. [identify groups of linked episodes which chain together](https://stackoverflow.com/a/12170710) and Linked therein. — Henrik, Jul 09 '21 at 09:01
Are you looking for a function which given an id returns a date string or to add a column of such date strings to the input or both? Please be specific and edit the question to clarify what it is asking. — G. Grothendieck, Jul 09 '21 at 18:11

score 1 · Accepted Answer · answered Jul 09 '21 at 09:25

You can do this with a simple recursive function that jumps to the next id2for each entry. But you have to be cautious not to include circular references in id1 and id2. Otherwise, you get endless recursion:

dscan = function(df,init=101){
  ni = (1:dim(df)[1])[df$id2==init & !is.na(df$id2)][1] ## Get the next line of df that fulfills the condition that id2 is the current id1
  nv = c(df$date[df$id1==init]) ## Current date
  if(!is.na(ni)>0){
    nx = df$id1[ni[1]]          ## Next index
    return(c(nv,dscan(df,nx)))  ## Recursion step
  } else {return(c(nv))}        ## Abort recursion if there is no next ni
}

The output would be:

> dscan(df,101)
[1] "01.1.2021" "12.1.2021" "17.1.2021" "18.1.2021"
> dscan(df,107)
[1] "10.1.2021" "11.1.2021"
> dscan(df,108)
[1] "11.1.2021"

score 0 · Answer 2 · answered Jul 09 '21 at 09:25

It's not perfect but it gets the job done:

library(zoo)

df %>%
  subset(id1 %in% id2 | id2 %in% id1) %>%
  mutate(id1 = na.locf(ifelse(is.na(id2), id1, NA))) %>%
  group_by(id1) %>%
  summarise_all(funs(toString(unique(.)))) %>%
  select(date)

Which returns:

  date                                      
  <chr>                                     
1 01.1.2021, 12.1.2021, 17.1.2021, 18.1.2021
2 10.1.2021, 11.1.2021

G. Grothendieck · Answer 3 · 2021-07-09T19:49:02.090

1) We will assume that what is wanted is an additional column which gives a comma separated string of dates. To do that we form a recursive common table expression (CTE) in SQL.

library(sqldf)
sqldf("with recursive R(id1, id2, date) as (
    select * from DF a where a.id1 = id1
    union all
    select a.id1, a.id2, R.date from DF a join R on a.id1 = R.id2
)
  select a.*, group_concat(r.date) dates 
  from DF a 
  left join R on a.id1 = R.id1
  group by a.rowid")

giving:

  id1 id2      date                                   dates
1 101  NA 01.1.2021 01.1.2021,12.1.2021,17.1.2021,18.1.2021
2 102 101 12.1.2021           12.1.2021,17.1.2021,18.1.2021
3 103 102 17.1.2021                     17.1.2021,18.1.2021
4 104 103 18.1.2021                               18.1.2021
5 105  NA 25.1.2021                               25.1.2021
6 106  NA 03.1.2021                               03.1.2021
7 107  NA 10.1.2021                     10.1.2021,11.1.2021
8 108 107 11.1.2021                               11.1.2021
9 109  NA 09.1.2021                               09.1.2021

2) If instead what is wanted is a function that takes the data frame and id and produces the date string then use the following. (Note that if id were character rather than numeric then $id at the end of the first select line should be replaced with '$id' .)

library(sqldf)

get_dates <- function(data, id) {
  fn$sqldf("with recursive R(id1, id2, date) as (
    select * from DF where id1 = $id
    union all
    select a.id1, a.id2, a.date from DF a join R on a.id2 = R.id1
)
  select group_concat(date) dates from R")$dates
}

get_dates(DF, 101)
## [1] "01.1.2021,12.1.2021,17.1.2021,18.1.2021"

get_dates(DF, 107)
## [1] "10.1.2021,11.1.2021"

We could use this function to generate the output in (1):

transform(DF, dates = sapply(id1, get_dates, data = DF))

Note

DF <- structure(list(id1 = 101:109, id2 = c(NA, 101L, 102L, 103L, NA, 
NA, NA, 107L, NA), date = c("01.1.2021", "12.1.2021", "17.1.2021", 
"18.1.2021", "25.1.2021", "03.1.2021", "10.1.2021", "11.1.2021", 
"09.1.2021")), class = "data.frame", row.names = c(NA, -9L))

Search string sequentially in R

3 Answers3

Note