Data management: flatten data with R

Question

I have the following dataframe gathering the evolution of policies:

Df <- data.frame(Id_policy = c("A_001", "A_002", "A_003","B_001","B_002"),
                 date_new = c("20200101","20200115","20200304","20200110","20200215"),
                 date_end = c("20200503","20200608","20210101","20200403","20200503"),
                 expend = c("","A_001","A_002","",""))

which looks like that:

  Id_policy date_new date_end expend
     A_001 20200101 20200503       
     A_002 20200115 20200608  A_001
     A_003 20200304 20210101  A_002
     B_001 20200110 20200403       
     B_002 20200215 20200503

"Id_policy" refers to a specific policy, "date_new" the date of policy issuance, "date_end" the date of policy end. However, sometimes a policy is extended. When it is the case, a new policy is set and the variable "expend" provides the name of the previous policy it changes.

The idea here is to flatten the dataset so we only keep rows corresponding to different policies. So, the output would be something like this:

  Id_policy date_new date_end expend
     A_001 20200101 20210101       
     B_001 20200110 20200403       
     B_002 20200215 20200503

Has-someone faced a similar problem ?

Henrik · Answer 1 · 2021-02-08T18:40:21.203

One way is to treat this as a network problem and use igraph functions (related posts e.g. Make a group_indices based on several columns ; Fast way to group variables based on direct and indirect similarities in multiple columns).

Set the missing 'expend' to 'Id_policy'
Use graph_from_data_frame to create a graph, where 'expend' and 'Id_policy' columns are treated as an edge list.
Use components to get connected components of the graph, i.e. which 'Id_policy' are connected, directly or indirectly.
Select the membership element to get "the cluster id to which each vertex belongs".
Join membership to original data.
Grab relevant data grouped by membership.

I use data.table for the data wrangling steps, but this can of course also be done in base or dplyr.

library(data.table)
library(igraph)

setDT(Df)
Df[expend ==  "", expend := Id_policy]

g = graph_from_data_frame(Df[ , .(expend, Id_policy)])
mem = components(g)$membership

Df[.(names(mem)), on = .(Id_policy), mem := mem]    

Df[ , .(Id_policy = Id_policy[1],
        date_new = first(date_new),
        date_end = last(date_end), by = mem]
#    mem Id_policy date_new date_end
# 1:   1     A_001 20200101 20210101
# 2:   2     B_001 20200110 20200403
# 3:   3     B_002 20200215 20200503

score 2 · Accepted Answer · answered Feb 08 '21 at 18:03

Here is a solution using igraph for creating a directed network of id's, and data.table to do some binding and joining. I kept in between results to show what each step does.

library( data.table )
library( igraph )
setDT(Df)
#create nodes and links
nodes <- Df[,1:3]
links <- Df[ !expend == "", .(from = expend, to = Id_policy) ]
g = graph_from_data_frame( links, vertices = nodes, directed = TRUE )
plot(g)

#find nodes without incoming (these are startpoints of paths)
in.nodes <- V(g)[degree(g, mode = 'in') == 0]
#define sumcomponents of the graph by looping the in.nodes
L <- lapply( in.nodes, function(x) names( subcomponent(g, x) ) )
# $A_001
# [1] "A_001" "A_002" "A_003"
# $B_001
# [1] "B_001"
# $B_002
# [1] "B_002"
L2 <- lapply( L, function(x) {
  #get first and last element
  dt <- data.table( start = x[1], end = x[ length(x) ] )
})
#bind list together to a single data.table
ans <- rbindlist( L2, use.names = TRUE, fill = TRUE, idcol = "Id_policy" )
#    Id_policy start   end
# 1:     A_001 A_001 A_003
# 2:     B_001 B_001 B_001
# 3:     B_002 B_002 B_002

#update ans with values from original Df for start and end
ans[ Df, `:=`( start = i.date_new ), on = .(start = Id_policy) ][]
ans[ Df, `:=`( end   = i.date_end ), on = .(end = Id_policy) ][]
# Id_policy    start      end
# 1:     A_001 20200101 20210101
# 2:     B_001 20200110 20200403
# 3:     B_002 20200215 20200503

score 0 · Answer 3 · answered Feb 08 '21 at 17:25

An outer for loop to go through each policy id in Df with an inner while loop to find the last extension for an original policy should work

Df <- data.frame(Id_policy = c("A_001", "A_002", "A_003","B_001","B_002"),
                 date_new = c("20200101","20200115","20200304","20200110","20200215"),
                 date_end = c("20200503","20200608","20210101","20200403","20200503"),
                 expend = c("","A_001","A_002","",""),
                 stringsAsFactors = F)
final_df <- data.frame(matrix(nrow = 0, ncol = 0), stringsAsFactors = F)

for (i in seq_len(nrow(Df))) {
  # Check to see if the current policy ID is in the column expend
  if (Df$Id_policy[i] %in% Df$expend || !Df$expend[i] == "") {
    # Loop through expend policy until last one is found
    found_last <- F
    j <- i
    end_date <- ""

    c_policy_id <- Df$Id_policy[j]
    expended_id <- Df$Id_policy[which(Df$expend == c_policy_id)]

    if (length(expended_id) > 0) {
      if (expended_id %in% Df$expend) {
        while(!found_last) {
          c_policy_id <- Df$Id_policy[j]
          expended_id <- Df$Id_policy[which(Df$expend == c_policy_id)]

          if (length(expended_id) > 0) {
            if (expended_id %in% Df$expend) {

              j <- which(Df$expend == expended_id)
            }
          }else{
            end_date <- Df$date_end[j]
            found_last <- T
          }
        }
        if (!end_date == "") {
          # Add to final df when found the last one
          final_df <- bind_rows(final_df, data.frame(Id_policy = Df$Id_policy[i],
                                                     date_new = Df$date_new[i],
                                                     date_end = end_date,
                                                     expend = ""))
        }
      }
    }
  }else{
    final_df <- bind_rows(final_df, Df[i, ])

  }
}

final_df

 Id_policy date_new date_end expend
1     A_001 20200101 20210101       
2     B_001 20200110 20200403       
3     B_002 20200215 20200503

Data management: flatten data with R

3 Answers3