In R, use mutate() to create a new column based on conditions by group

Question

For each person, there are two types of visits and for each visits, there are date records. The dataset looks like below.

p <-c(1,1,1,2,2,2,2,3,3,3,4)
type <- c(15,20,20,15,20,15,20,20,15,15,15)
date <- as.Date.factor(c("2014-02-03","2014-02-04","2014-02-06","2014-01-28","2014-02-03","2014-03-03","2014-03-13","2014-04-03","2014-04-09","2014-12-03","2014-04-05"))
d <- data.frame(p,type,date)

So now the dataset looks like this.

> d
   p type       date
1  1   15 2014-02-03
2  1   20 2014-02-04
3  1   20 2014-02-06
4  2   15 2014-01-28
5  2   20 2014-02-03
6  2   15 2014-03-03
7  2   20 2014-03-13
8  3   20 2014-04-03
9  3   15 2014-04-09
10 3   15 2014-12-03

Now, I'd like to create three new columns.

indicating whether a type 20 visit happens in 7 days after the type 15 visit, if yes then the indicator is 1, otherwise 0.(for example, for p2, in the line 4, this value should be 1, and in the line 6, this value should be 0)
What is the first date of type 20 visit happened in 7 days after the type 15 visit. If there is no type 20 visit in 7 days after the type 15, then keep it blank. (for example, for p1, the value should be 2014-02-04 instead of 2014-02-06)
How many days is between the type 15 visit and type 20 visit happened in 7 days from it. If there is no type 20 visit in 7 days after the type 15, then keep it blank.(for example, the value in line 1 should be 1)

I'm a super newbie in R, and basically have no idea of what to do. I tried a for loop within group, but it never works.

group_by(p)%>%
for(i in i:length(date)){
  *if(type[i]== 15 && date[i]+7 >= date[i+1:length(date)]){
  indicator = 1
  first_date = 
  days =* #Have no idea how to check in this part
} else {
  indicator = 0
  first_date = NA
  days = NA
}

The expected output is as below.

   p type       date ind first_date days
1  1   15 2014-02-03   1 2014-02-04    1 # = 2014-02-04 - 2014-02-03  
2  1   20 2014-02-04  NA       <NA>   NA
3  1   20 2014-02-06  NA       <NA>   NA
4  2   15 2014-01-28   1 2014-02-03    6 # = 2014-02-03 - 2014-01-28   
5  2   20 2014-02-03  NA       <NA>   NA
6  2   15 2014-03-03   0       <NA>   NA # since (2014-03-13 - 2014-03-03) > 7   
7  2   20 2014-03-13  NA       <NA>   NA
8  3   20 2014-04-03  NA       <NA>   NA #I don't care about the value for type 20 lines
9  3   15 2014-04-09   0       <NA>   NA
10 3   15 2014-12-03   0       <NA>   NA

So I come up with a new idea. What if we group records by p and type == 15.Then we can use subtraction within groups as days, and the rest will be easy.

I found one way in doing this:

 d[,group:= cumsum(type ==15)]

However, this will count group when encountering a new type 15 record. How to add p as another grouping condition?

You should ask one question at a time here. See http://stackoverflow.com/help/how-to-ask — Frank, Mar 20 '17 at 16:51
@Axeman Thanks! I struggled a little bit about it. Another hint showing that I know to little about R. sigh — Megan Wenjie Song, Mar 20 '17 at 17:00
@rawr yes, that is the case for p2. He has multiple 15 dates. and we need to check for each type 15 visit. Let me try your method. Thanks! Why R is so hard ; ( — Megan Wenjie Song, Mar 20 '17 at 17:03
@rawr It doesn't work, since we need to calculate the date - min(date[type == 15]) for each type == 15. I got the date difference for p2 is 0, 26,54,64, meaning it didn't calculate for the second type == 15 visit — Megan Wenjie Song, Mar 20 '17 at 17:16
@yeedle because there is no type 20 records in 7 days after these type 15 visit happened — Megan Wenjie Song, Mar 20 '17 at 19:00
What happens if a person has 2 type 15 visits and then a type 20 visits within 7 days of both type 15 visits, do both type 15 visits get a 1 in the ind column? — yeedle, Mar 20 '17 at 19:08
@yeedle then get a 1 for the latest type 15 visit. So I think grouping is a method. Make p and type == 15 as two conditions for grouping. But I don't know how to add a new column of group number into the dataset. — Megan Wenjie Song, Mar 20 '17 at 19:29
@MeganWenjieSong and yes r is hard in that there is a bit of a learning curve, more so than for other languages, but imo it can be exceptionally powerful yet elegant after you get over the hump — rawr, Mar 21 '17 at 03:09

yeedle · Accepted Answer · 2017-03-20T20:38:47.723

I took a stab at this. There's one caveat though: My answer assumes that after a type 15 visit occurs, the next visit within 7 days will be a type_20 visit. If that's not the case, i.e. there's another type 15 visit within 7 days, the first type 15 visit won't be considered, and only the second type 15 visit matters:

library(dplyr)
library(tidyr)
library(lubridate)

d %>% 
  mutate(rownum = 1:n()) %>%
  spread(type, date, sep="_")  %>% 
  group_by(p) %>%
  mutate(ind = ifelse(lead(type_20) - type_15 <= 7, 1, 0)) %>%
  mutate(ind = ifelse(is.na(ind), 0, ind)) %>%
  mutate(ind = ifelse(is.na(type_15), NA, ind)) %>%
  mutate(first_date = ifelse(ind == 1, lead(type_20), NA)) %>%
  mutate(first_date = as.Date(first_date, origin = lubridate::origin)) %>%
  mutate(days = first_date - type_15) %>%
  gather("type", "date", type_15, type_20) %>% 
  filter(!is.na(date)) %>% 
  arrange(p, date) %>%
  select(p, type, date, ind, first_date, days)

#       p    type       date   ind first_date    days
#   <dbl>   <chr>     <date> <dbl>     <date>  <time>
#1      1 type_15 2014-02-03     1 2014-02-04  1 days
#2      1 type_20 2014-02-04    NA       <NA> NA days
#3      1 type_20 2014-02-06    NA       <NA> NA days
#4      2 type_15 2014-01-28     1 2014-02-03  6 days
#5      2 type_20 2014-02-03    NA       <NA> NA days
#6      2 type_15 2014-03-03     0       <NA> NA days
#7      2 type_20 2014-03-13    NA       <NA> NA days
#8      3 type_20 2014-04-03    NA       <NA> NA days
#9      3 type_15 2014-04-09     0       <NA> NA days
#10     3 type_15 2014-12-03     0       <NA> NA days

Let me try to explain what I'm doing:

First the type and date columns are spread so that the type and date appear in separate columns (this makes it easier to compare dates of the two different type). Next, a couple of mutates. The first three apply the conditions outlined in the questions, as follows: if lead(type_20) - type_15 <= 7) that means there was a type 20 visit within 7 days of a type 15 visit, so we mark that as 1, else we mark as 0. After this, if ind is NA, we assume no type 20 visit was found so we also mark it as 0. In the third mutate we mark the type 15 NA lines as NA.

The next three mutate lines add the columns outlined in 2 and 3 in the question.

Finally, the columns are gathered back up to their previous format, redundant rows are filtered out, the dataframe is arranged by p and date, and the needed columns are selected.

I hope this is clear enough. It might be helpful to run the code line by line, stopping to view the transformed data frame after each line to see how the transformations act on the dataframe.

Wow, that' really cool. I think your assumption works. But when I try your codes, I got a error message: Error: Duplicate identifiers for rows (4, 6), (9, 10), (2, 3), (5, 7). — Megan Wenjie Song, Mar 20 '17 at 20:06
Ah, right. You need to add a column of the row numbers. See http://stackoverflow.com/questions/25960394/unexpected-behavior-with-tidyr#comment40693047_25960394 — yeedle, Mar 20 '17 at 20:10
@MeganWenjieSong I added a `mutate` to my code to fix that duplicate identifiers problem and have also added a missing `%>%` in one of the rows. Try to rerun the code to see if it works for you now. — yeedle, Mar 20 '17 at 20:39
thanks! when I run this code in a larger data, I found a problem. If there is only one type 15 record of a person, meaning this person visits in type 15 and never comes back, the ind should be 0, but the output is 1. — Megan Wenjie Song, Mar 21 '17 at 02:31
@yeedel sure. I made an adjustment for the dataset. You can copy it now. — Megan Wenjie Song, Mar 21 '17 at 02:55
you are right. In this small dataset it is right, and then in the large dataset, it is not. It's very weird. In my output for larger dataset, the first_day becomes the type_15 date. Have no clue. — Megan Wenjie Song, Mar 21 '17 at 03:39

rawr · Answer 2 · 2017-03-21T03:05:47.570

Here is a base R way. Generally, I prefer to create a function that does your task which can then be repeated on other pieces and debugged on test cases where it doesn't seem to work.

The first step is to define the pieces:

d <- structure(list(p = c(1, 1, 1, 2, 2, 2, 2, 3, 3, 3),
                    type = c(15, 20, 20, 15, 20, 15, 20, 20, 15, 15),
                    date = structure(c(16104, 16105, 16107, 16098, 16104, 16132, 16142, 16163, 16169, 16407), class = "Date")),
               .Names = c("p", "type", "date"),
               row.names = c(NA, -10L), class = "data.frame")

id <- with(d, {
  id <- ave(type, p, FUN = function(x) cumsum(x == 15))
  factor(paste0(p, id), unique(paste0(p, id)))
})

sp <- split(d, id)

So, sp creates a list of data frames to which we will apply a function. Each piece is a single unique p with at most one type == 15 (plus however many type == 20s follow.

The first two pieces are

sp[1:2]

# $`11`
#   p type       date
# 1 1   15 2014-02-03
# 2 1   20 2014-02-04
# 3 1   20 2014-02-06
# 
# $`21`
#   p type       date
# 4 2   15 2014-01-28
# 5 2   20 2014-02-03

And we can apply the function below on each one

first_date(sp[[1]])

#   p type       date ind first_date days
# 1 1   15 2014-02-03   1 2014-02-04    1
# 2 1   20 2014-02-04  NA       <NA>   NA
# 3 1   20 2014-02-06  NA       <NA>   NA

first_date(sp[[2]])

#   p type       date ind first_date days
# 4 2   15 2014-01-28   1 2014-02-03    6
# 5 2   20 2014-02-03  NA       <NA>   NA

Or all at once with a loop

(sp1 <- lapply(sp, first_date))
`rownames<-`(do.call('rbind', sp1), NULL)

#    p type       date ind first_date days
# 1  1   15 2014-02-03   1 2014-02-04    1
# 2  1   20 2014-02-04  NA       <NA>   NA
# 3  1   20 2014-02-06  NA       <NA>   NA
# 4  2   15 2014-01-28   1 2014-02-03    6
# 5  2   20 2014-02-03  NA       <NA>   NA
# 6  2   15 2014-03-03   0       <NA>   NA
# 7  2   20 2014-03-13  NA       <NA>   NA
# 8  3   20 2014-04-03  NA       <NA>   NA
# 9  3   15 2014-04-09   0       <NA>   NA
# 10 3   15 2014-12-03   0       <NA>   NA

You can take advantage of the arguments, like window, or any others you add without changing much of the function, for example, to change the window

(sp2 <- lapply(sp1, first_date, window = 14))
`rownames<-`(do.call('rbind', sp2), NULL)

#    p type       date ind first_date days ind first_date days
# 1  1   15 2014-02-03   1 2014-02-04    1   1 2014-02-04    1
# 2  1   20 2014-02-04  NA       <NA>   NA  NA       <NA>   NA
# 3  1   20 2014-02-06  NA       <NA>   NA  NA       <NA>   NA
# 4  2   15 2014-01-28   1 2014-02-03    6   1 2014-02-03    6
# 5  2   20 2014-02-03  NA       <NA>   NA  NA       <NA>   NA
# 6  2   15 2014-03-03   0       <NA>   NA   1 2014-03-13   10
# 7  2   20 2014-03-13  NA       <NA>   NA  NA       <NA>   NA
# 8  3   20 2014-04-03  NA       <NA>   NA  NA       <NA>   NA
# 9  3   15 2014-04-09   0       <NA>   NA   0       <NA>   NA
# 10 3   15 2014-12-03   0       <NA>   NA   0       <NA>   NA

first_date <- function(data, window = 7) {
  nr <- nrow(data)

  ## check at least one type 15 and > 1 row
  ty15 <- data$type == 15
  dt15 <- data$date[ty15]

  if (!any(ty15) | nr == 1L)
    return(cbind(data, ind = ifelse(any(ty15), 0, NA),
                 first_date = NA, days = NA))

  ## first date vector
  dts <- rep(min(data$date[!ty15]), nr)
  dts[!ty15] <- NA

  ## days from the type 15 date
  days <- as.numeric(data$date[!ty15] - min(dt15))
  days <- c(days, rep(NA, nr - length(days)))

  ## convert to NA if criteria not met
  to_na <- days > window | is.na(dts)
  days[to_na] <- dts[to_na] <- NA

  ## ind vector -- 1 or 0 if type 15, NA otherwise
  ind <- rep(NA, nr)
  ind[ty15] <- as.integer(!is.na(dts[ty15]))

  ## combine
  cbind(data, ind = ind, first_date = dts, days = days)
}

Hi rawr, I really like your codes, because it considered all possibilities. But the first_day is not the date of type 15 visit, but the date of the following type 20 visit if it happens in 7 days after the type 15. — Megan Wenjie Song, Mar 21 '17 at 02:52
@MeganWenjieSong sorry about that, made a small change, see edit — rawr, Mar 21 '17 at 02:59
Hi rawr, thanks very much for helping me. I already solved my problem using your coding, since it's very easy to understand and realize the function I need. Thanks for your time! But I will accept the answer of another guy, since he spent the most time with me and also solved this problem. I really appreciate and learned a lot! — Megan Wenjie Song, Mar 23 '17 at 01:27
@MeganWenjieSong no hard feelings :} happy to help, good luck with this R thing — rawr, Mar 23 '17 at 02:28

score 1 · Answer 3 · answered Mar 20 '17 at 21:26

If you're willing to use some functions from the purrr package and to use some custom functions, here is another option...

Packages you'll need

library(dplyr)
library(purrr)

Set up data (as per question)

p <-c(1,1,1,2,2,2,2,3,3,3)
type <- c(15,20,20,15,20,15,20,20,15,15)
date <- as.Date.factor(c("2014-02-03","2014-02-04","2014-02-06","2014-01-28","2014-02-03","2014-03-03","2014-03-13","2014-04-03","2014-04-09","2014-12-03"))
d <- data.frame(cbind(p,type,date))
d$date = as.Date(date)

Create custom functions that will work with the purrr map_* functions to iterate through your data frame and create ind and first_date.

# Function to manage ind
ind_manager <- function(type, date, dates_20) {
  if (type == 20)
    return (NA_integer_)

  checks <- map_lgl(dates_20, between, date, date + 7)
  return (as.integer(any(checks)))
}

# Function to manage first_date
first_date_manager <- function(ind, date, dates_20) {
  if (is.na(ind) || ind != 1)
    return (NA_character_)

  dates_20 <- dates_20[order(dates_20)]
  as.character(dates_20[which.max(date < dates_20)])
}

Save a vector of dates where type == 20 to be used as comparisons

dates_20 <- d$date[d$type == 20]

The final mutate() call

# mutate() call to create variables
d %>% 
  mutate(
    ind = map2_int(type, date, ind_manager, dates_20),
    first_date = as.Date(map2_chr(ind, date, first_date_manager, dates_20)),
    days = as.integer(first_date - date)
  )
#>    p type       date ind first_date days
#> 1  1   15 2014-02-03   1 2014-02-04    1
#> 2  1   20 2014-02-04  NA       <NA>   NA
#> 3  1   20 2014-02-06  NA       <NA>   NA
#> 4  2   15 2014-01-28   1 2014-02-03    6
#> 5  2   20 2014-02-03  NA       <NA>   NA
#> 6  2   15 2014-03-03   0       <NA>   NA
#> 7  2   20 2014-03-13  NA       <NA>   NA
#> 8  3   20 2014-04-03  NA       <NA>   NA
#> 9  3   15 2014-04-09   0       <NA>   NA
#> 10 3   15 2014-12-03   0       <NA>   NA

In R, use mutate() to create a new column based on conditions by group

3 Answers3

Linked