R: for loop creating new columns populated by conditional statement based on the previous column

Question

my [simplified] data looks like this:

id = sample(1:20, 5)
first_active = c(1,1,1,2,3)
week1 = c(1,1,1,0,0)
week2 = c(1,0,0,1,0)
week3 = c(1,0,1,0,1)
week4 = c(1,0,0,0,1)
week5 = c(0,0,0,0,1)

df = data.frame(cbind(id, first_active, week1, week2, week3, week4, week5))

I want to create a for loop that would:

i) in the same data.frame, create columns p1, p2,... corresponding to week1, week2,... columns and populate them with the following:

i) if the corresponding week value is not 0, then "active"`

ii) if value for a given week is 0, then check the previous p-columns status: if p[i-1] == "active" then "lapsed1"

iii) if value for a given week is 0, then check the previous p-columns status: if p[i-1] == "lapsed[j]" then "lapsed[j+1]"

iv) otherwise, return NA

this would be the solution to the above example (using mutate in dplyr):

df %>%
mutate( p1 = ifelse(week1 != 0, "active", NA),
      p2 = ifelse(week2 !=0, "active", 
                  ifelse(p1 == "active", "lapsed1", NA)),
      p3 = ifelse(week3 !=0, "active", 
                  ifelse(p2 == "lapsed1", "lapsed2",
                  ifelse(p2 == "active", "lapsed1", NA))),
      p4 = ifelse(week4 !=0, "active", 
                  ifelse(p3 == "lapsed2", "lapsed3",
                  ifelse(p3 == "lapsed1", "lapsed2",
                         ifelse(p3 == "active", "lapsed1", NA)))),
      p5 = ifelse(week5 !=0, "active", 
                  ifelse(p4 == "lapsed3", "lapsed4",
                  ifelse(p4 == "lapsed2", "lapsed3",
                         ifelse(p4 == "lapsed1", "lapsed2",
                                ifelse(p4 == "active", "lapsed1", NA)))))
  )


 id first_active week1 week2 week3 week4 week5     p1      p2      p3      p4      p5
  9            1     1     1     1     1     0 active  active  active  active lapsed1
  5            1     1     0     0     0     0 active lapsed1 lapsed2 lapsed3 lapsed4
 14            1     1     0     1     0     0 active lapsed1  active lapsed1 lapsed2
  3            2     0     1     0     0     0   <NA>  active lapsed1 lapsed2 lapsed3
  8            3     0     0     1     1     1   <NA>    <NA>  active  active  active

I want to create a function/for loop that would do it automatically, as my original data has tens of 'week' columns to refer to.

What I managed to get so far is:

df$p1 = ifelse(df$week1 > 0, "active", NA) # initiating the first p-column

for(i in 2:(ncol(df)-2)) { # defining dynamically number of periods

column_to_write = paste0("p", i, sep="") # column to be populated 
prev_column = paste0("p", i-1, sep="") #previous p-column to the one that's being populated
orig_column = paste0("week", i, sep="") #reference 'week' column
j = 1 #initiating 'lapsed' number

df[column_to_write] = ifelse(df[orig_column]> 0, "active", 
                                  ifelse(df[prev_column] == "active", paste("lapsed", j, sep=""), 
                                  ifelse(df[prev_column] == paste0("lapsed", j, sep=""), paste0("lapsed", j=j+1, sep=""), NA)))

}

but this only gives me max values of "lapsed2" and creates new columns called week[i] rather than p[i].

 id first_active week1 week2 week3 week4 week5     p1   week2   week3   week4   week5
  9            1     1     1     1     1     0 active  active  active  active lapsed1
  5            1     1     0     0     0     0 active lapsed1 lapsed2    <NA>    <NA>
 14            1     1     0     1     0     0 active lapsed1  active lapsed1 lapsed2
  3            2     0     1     0     0     0   <NA>  active lapsed1 lapsed2    <NA>
  8            3     0     0     1     1     1   <NA>    <NA>  active  active  active

How do I change the code so that numbers in "lapsed" values continue to rise beyond 2?

Thanks for your help! Kasia

This is very manual as you have it. You should convert your data to long format (`reshape2::melt` or `tidyr::gather`) so that you have a single `week_num` column with values 1 to 5 and a `week_val` column with the 1s and 0s form your data. Then you can add single `p` column. When finished, you can get your data back to wide format if necessary (`reshape2::dcast` or `tidyr::spread`). This will scale nicely - if you have 5 weeks or 500 weeks the code will be the same. — Gregor Thomas, Sep 20 '16 at 19:56
Hm, how would I be able to refer to the previous period/week for the same id if the data is in the long format? — Kasia Kulma, Sep 20 '16 at 20:00
Easiest would be to use `dplyr` or `data.table` and their grouping functions. Grouping by `id`, sorting by `week_num`, then something like `p = ifelse(week_val != 0, 'active', NA)` to start, then you could do something with `paste` and `rle` for your `"lapsed[i]"` variables. Something along the lines of [R: count consecutive occurrences of values](http://stackoverflow.com/q/19998836/903061). I don't have time to write up a full solution now - I'll try to find time later on if no one beats me to it. — Gregor Thomas, Sep 20 '16 at 20:10
cool, thanks for clarifying. Will try to play with `rle` and will post the answer here if I'm successful, thanks! — Kasia Kulma, Sep 20 '16 at 20:13
Brilliant, your suggestion worked like a charm, thank you! If you post your answer, I'll be happy to upvote it, but if you're too busy, I can post my answer. Thanks again! — Kasia Kulma, Sep 20 '16 at 22:59

score 12 · Accepted Answer · answered Sep 21 '16 at 08:30

At the end I gave up on the for loop and instead followed the suggestions posted by @Gregor; here's what I did:

df_long = melt(df, id.vars = c("id", "first_active")) #transformed my wide data to the long format
colnames(df_long) = c("id", "first_active", "week_num", "week_orders")


df_long = 
df_long %>%
mutate(p_var = paste("p", substr(week_num, 5, 5), sep="")) %>% #created p-columns that correspond to respective weeks arrange(id, week_num) %>%
group_by(id) %>%
mutate(active_var = ifelse(week_orders != 0, "active", 
                  ifelse(first_active < as.numeric(substr(week_num, 5,5)),
                         "lapsed", NA))) %>% #created a column that would return either "active", "lapsed" or NA depending on user activity
     mutate(lapsed_num =  sequence(rle(active_var)[["lengths"]]), #created a column that would count the number of occurences of "lapsed" for a given id; it would start counting from 1 if after "active" appeared 
            final = ifelse(active_var == "active", active_var, 
                           ifelse(active_var == "lapsed", paste(active_var, lapsed_num, sep=""), NA))) %>% #finally, the column takes "active" status or coalesces "lapsed" with the sequence number
select(id, first_active, week_num, week_orders, p_var, final) %>%
                           data.frame()

At the end, my data looked like this:

head(df_final, 25)
active_var id first_active week_num week_orders p_var   final
     <NA>  3            2    week1           0    p1    <NA>
   active  3            2    week2           1    p2  active
   lapsed  3            2    week3           0    p3 lapsed1
   lapsed  3            2    week4           0    p4 lapsed2
   lapsed  3            2    week5           0    p5 lapsed3
   active  5            1    week1           1    p1  active

So, I all I needed to do was to cast the data.frame (in two steps)

df_weeks = dcast(df_long[, 1:4], id + first_active ~ week_num,  value.var = "week_orders")

df_p = dcast(df_long[, c(1:2, 5:6)], id + first_active ~ p_var,  value.var = "final")

And join them..

df_solution = inner_join(df_weeks, df_p)

Voila!

df_solution
id first_active week1 week2 week3 week4 week5     p1      p2      p3      p4      p5
 3            2     0     1     0     0     0   <NA>  active lapsed1 lapsed2 lapsed3
 5            1     1     0     0     0     0 active lapsed1 lapsed2 lapsed3 lapsed4
 8            3     0     0     1     1     1   <NA>    <NA>  active  active  active
 9            1     1     1     1     1     0 active  active  active  active lapsed1
14            1     1     0     1     0     0 active lapsed1  active lapsed1 lapsed2

R: for loop creating new columns populated by conditional statement based on the previous column

1 Answers1