Group by id a dataset then create a new column subtracting different columns from current and subsequent row

Question

I have found similar problem like this, but not exactly the same. Here is my problem, I have the following data set:

> ds
  id      begin        end
1  1 2017-01-15 2017-01-17
2  1 2017-01-01 2017-01-03
3  2 2017-02-01 2017-02-28
4  4 2017-04-11 2017-05-11
5  3 2017-02-05 2017-02-10
6  4 2017-03-10 2017-03-20
7  1 2017-01-30 2017-02-03
8  3 2017-02-28 2017-03-09
9  4 2017-02-26 2017-03-05

I want to create the following column: check that verifies the following condition for each rows with the same id value:

ds[i,]$begin - ds[i-1,]$end < 30 => 1 # for each row i

otherwise is 0. When it is the first element of the group, then there is no previous information, so in such case the value of the new column check will be always zero too.

The difference with other questions (solved via: ave, dplyr) I have seen, is that I need to make a calculation that involves subsequent row but also not the same column.

Here is the code:

Defining the data set

id <- c("1", "1", "2", "4", "3", "4", "1", "3")
begin <- c("20170115", "20170101", "20170201",
    "20170411",
    "20170205", "20170310",
    "20170130", "20170228"
    )

end <- c("20170117", "20170103", "20170228",
    "20170511",
    "20170210", "20170320",
    "20170203", "20170309"
    )

ds <- data.frame(id = id, begin = as.Date(begin, "%Y%m%d"), end = as.Date(end, "%Y%m%d"))

Sorting the information (we need it for the current solution using a for-loop)

idx = order(rank(ds$id), ds$begin, decreasing = FALSE)
ds <- ds[idx,]

Now using the for-loop for assigning the control variable: check:

ds$check <- numeric(nrow(ds))
ds$check <- NA_integer_

nrep <- -1
for (i in 1:nrow(ds)) {
    rowi <- ds[i,]
    if (nrep == -1) {# Setting the first element of ds
        end.prev <- rowi$end
        id.prev <- rowi$id
        ds[i,]$check <- 0
        nrep = 1
    } else {
        id.current <- rowi$id
        if(id.prev == id.current) {
            ds[i,]$check <- ifelse(rowi$begin - end.prev < 30, 1, 0)
        } else {
             ds[i,]$check <- 0
        }
        end.prev <- rowi$end
        id.prev <- id.current
    }
}

Finally the expected output:

> ds
  id      begin        end check
2  1 2017-01-01 2017-01-03     0
1  1 2017-01-15 2017-01-17     1
7  1 2017-01-30 2017-02-03     1
3  2 2017-02-01 2017-02-28     0
5  3 2017-02-05 2017-02-10     0
8  3 2017-02-28 2017-03-09     1
6  4 2017-03-10 2017-03-20     0
4  4 2017-04-11 2017-05-11     1
>

Thanks for any hint.

Yes, it works! I don't know even how. Please can you explain a little bit how it works, there no so much documentation about `dplyr`, at least I found always very basic examples. Thanks. — David Leal, Feb 27 '17 at 16:01

Axeman · Accepted Answer · 2017-02-27T16:05:41.713

1

With dplyr, using lag makes this easy:

ds %>% 
  group_by(id) %>% 
  arrange(id, begin) %>% 
  mutate(check = c(0, as.numeric(begin - lag(end) < 30)[-1]))

Gives:

Source: local data frame [8 x 4]
Groups: id [4]

      id      begin        end check
  <fctr>     <date>     <date> <dbl>
1      1 2017-01-01 2017-01-03     0
2      1 2017-01-15 2017-01-17     1
3      1 2017-01-30 2017-02-03     1
4      2 2017-02-01 2017-02-28     0
5      3 2017-02-05 2017-02-10     0
6      3 2017-02-28 2017-03-09     1
7      4 2017-03-10 2017-03-20     0
8      4 2017-04-11 2017-05-11     1

Explanation:

First we group_by, so calculations are done by id.
We arrange to make sure we start with the earliest date.
For all checks, the first value is 0,
and the other values are simply whether the current begin minus the previous end is smaller than 30. We use as.numeric to convert from logical to numeric.

Also see: vignette('window-functions').

Note: I think the first value per group can simply be left at NA, since it is indeed unknown or undefined, and this will simplify it to:

ds %>% 
  group_by(id) %>%
  mutate(check = as.numeric(begin - lag(end, order_by = begin) < 30))

edited Feb 27 '17 at 16:05

answered Feb 27 '17 at 15:57

Axeman

32,068
8
81
94

In my real case I need to put a zero in the first row of the group. I understand for this second approach (with `NA`) in order to have the solution sorted we need to add: `arrange(id, begin) %>%`. What do you mean in your solution by: `[-1]` in this context? Good tip for vignette documentation. Can you share more reference documentation about this topic. It is still for me difficult to understand the syntax. Thanks. – David Leal Feb 27 '17 at 17:27
1

The code `[-1]` excludes the first element of the vector, so we can have 0 instead. That is all the docs there is, as far as I know. – Axeman Feb 27 '17 at 17:30
I found this web [page](https://github.com/hadley/dplyr) with several resources about `dplyr`package. Here is the [data camp](https://www.datacamp.com/courses/dplyr-data-manipulation-r-tutorial) training. I shared an equivalent one information for `data.table` (comment on @ShenglinChen answer. – David Leal Feb 28 '17 at 03:20

score 1 · Answer 2 · answered Feb 27 '17 at 17:07

1

Use data.table:

setDT(ds)[,New:=as.numeric(begin-shift(end,fill=0)<30),id]

Use dplyr:

ds%>%group_by(id)%>%mutate(new=as.numeric(begin-lag(end,default=0)<30))%>%arrange(id)

answered Feb 27 '17 at 17:07

Shenglin Chen

4,504
11
11

1

You are assuming the dates are sorted, but they are not. Good call on the use of `default` though. – Axeman Feb 27 '17 at 17:16
@ShenglinChen very nice solution with `setDT` and `data.table`, and by the way this package has a very nice [tutorial](https://www.datacamp.com/courses/data-table-data-manipulation-r-tutorial). In both cases the result is not the expected. See for example the result for: `[1, "200170101", "20170103"]` you get `1` and it should be `0`. We need to specify a grouping first (by `id` then by `begin` in ascending order) otherwise it gets a wrong result I guess. – David Leal Feb 28 '17 at 00:10
@ShenglinChen I edited your solution adding the arrange clause, now it get the expected result. Once I will figure it out how to apply the same for `data.table`, and I will edit again the solution. – David Leal Feb 28 '17 at 18:26
@ShenglinChen, the edit was rejected, here is the change in order to get the expected result: `ds %>% group_by(id) %>% arrange(id, begin) %>% mutate(new = as.numeric(begin - lag(end, default = 0) < 30))` – David Leal Mar 01 '17 at 04:15

Group by id a dataset then create a new column subtracting different columns from current and subsequent row

2 Answers2