Identifying dates common to multiple IDs in R

Question

I have a data frame called "diff2" containing two different time point columns ("original" and "time_point"), the differences (in hours) between those time points in the same row, and an ID corresponding to "original". Below is an example of a snippet of the data frame:

 diff            original          time_point ID
32  130 2012-12-16 04:59:32 2012-12-21 14:57:04  5
41  106 2012-12-16 06:01:02 2012-12-20 15:57:14  6
42  107 2012-12-16 06:01:02 2012-12-20 16:56:59  6
43  108 2012-12-16 06:01:02 2012-12-20 17:56:49  6
44  129 2012-12-16 06:01:02 2012-12-21 14:57:04  6
45  130 2012-12-16 06:01:02 2012-12-21 15:56:54  6
49  104 2012-12-16 06:59:52 2012-12-20 14:59:29  7
50  105 2012-12-16 06:59:52 2012-12-20 15:57:14  7
51  106 2012-12-16 06:59:52 2012-12-20 16:56:59  7
52  107 2012-12-16 06:59:52 2012-12-20 17:56:49  7
53  108 2012-12-16 06:59:52 2012-12-20 18:57:24  7
54  109 2012-12-16 06:59:52 2012-12-20 19:56:59  7

Many of the dates in "original" have dates in "time_point" in common. For example, date 2012-12-20 15:57:14 in "time_point" is common for dates 2012-12-16 06:01:02 (ID #6) and 2012-12-16 06:59:52 (ID #7) in "original". I need to first find the dates in "time_point" that are common to more than one "original". Then, for each common date in "time point", I need to determine the earliest date of "original" which is associated with. This common "time_point" date then needs to be removed from all other "originals" it is associated with. The resulting data frame that I expect is the following:

 diff            original          time_point ID
32  130 2012-12-16 04:59:32 2012-12-21 14:57:04  5
41  106 2012-12-16 06:01:02 2012-12-20 15:57:14  6
42  107 2012-12-16 06:01:02 2012-12-20 16:56:59  6
43  108 2012-12-16 06:01:02 2012-12-20 17:56:49  6
44  129 2012-12-16 06:01:02 2012-12-21 14:57:04  6
45  130 2012-12-16 06:01:02 2012-12-21 15:56:54  6
49  104 2012-12-16 06:59:52 2012-12-20 14:59:29  7
53  108 2012-12-16 06:59:52 2012-12-20 18:57:24  7
54  109 2012-12-16 06:59:52 2012-12-20 19:56:59  7

I have no idea how to go about this other than maybe a loop comparing IDs pair-wise and determining whether there are "time_point" dates in common.

Please read on [How to make a great R reproducible example?](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). In particular, use `dput` to provide your data. — kangaroo_cliff, Feb 01 '18 at 03:25

score 0 · Answer 1 · answered Feb 01 '18 at 02:18

0

library(dplyr)

diff2 %>% group_by(time_point) %>%
  mutate(counts = n()) %>%  # count the occurrences of each time_point
  filter(counts > 1) %>% # remove rows for singular time_points
  arrange(time_point, original) %>% # put earliest original value in first position row for each time_point
  slice(1) %>% # take only the top row of each time_point group  
  ungroup()

answered Feb 01 '18 at 02:18

David Klotz

2,401
1
7
16

When I run this code, I get the warning "Column `original` must be a 1d atomic vector or a list" – P_Grace Feb 01 '18 at 02:22
Works for me on this small sample -- maybe remove the commented sections and try again. – David Klotz Feb 01 '18 at 02:32
It appears my "original" column is actually a data frame and this may be why the code is not working, so I changed it to a character class. When I run the code once more, the following error shows: "Error in arrange_impl(.data, dots) : incorrect size (934) at position 1, expecting : 925" – P_Grace Feb 01 '18 at 03:08

score 0 · Answer 2 · answered Feb 01 '18 at 02:51

A bit of a function based approach (assuming your data is of class data.frame):

## Finding the duplicated time points
duplicated_time_points <- which(duplicated(data$time_point))

## Finding the earliest "original" for multiple "time_points"
find.earliest.original <- function(time.point.duplicate, data) {

    ## Extract the originals
    originals <- data$original[which(data$time_point == data$time_point[time.point.duplicate])]

    ## Finding the earliest original
    return(min(format(originals, format = "%Y-%m-%d %H:%M:%S")))
}

## Applying this function to each duplicated dates
early_originals <- sapply(duplicated_time_points, find.earliest.original, data)

## Removing the time points that do not correspond to the earliest original from the data
remove.not.earliest.original <- function(time.point.duplicate, data) {
    ## Selecting the subdata with the duplicated time_points
    sub_data <- which(data$time_point == data$time_point[time.point.duplicate])

    ## Selecting the rows in the subdata that are not the earliest original
    return(sub_data[which(data$original[sub_data] != find.earliest.original(time.point.duplicate, data))])
}

## Applying this function to each duplicated dates
columns_to_remove <- sapply(duplicated_time_points, remove.not.earliest.original, data)

## Removing the columns
data <- data[-columns_to_remove,]

Note that the early_originals variable is not used but can be useful to check what's going on.

This should lead to:

    X diff            original          time_point ID
1  32  130 2012-12-16 04:59:32 2012-12-21 14:57:04  5
2  41  106 2012-12-16 06:01:02 2012-12-20 15:57:14  6
3  42  107 2012-12-16 06:01:02 2012-12-20 16:56:59  6
4  43  108 2012-12-16 06:01:02 2012-12-20 17:56:49  6
6  45  130 2012-12-16 06:01:02 2012-12-21 15:56:54  6
7  49  104 2012-12-16 06:59:52 2012-12-20 14:59:29  7
11 53  108 2012-12-16 06:59:52 2012-12-20 18:57:24  7
12 54  109 2012-12-16 06:59:52 2012-12-20 19:56:59  7

Assuming you actually wanted to remove row ID 44 and it was an omission in your example above.

The very last line returns "Error in -columns_to_remove : invalid argument to unary operator", any idea why this may be? — P_Grace, Feb 01 '18 at 03:15
Do `early_originals` or `columns_to_remove` actually have any data? This code will work (hopefully) if the following are `TRUE`: `class(data) == "data.frame"`; `length(data$original) > 0`; `length(data$time_point) > 0`. — Thomas Guillerme, Feb 01 '18 at 03:27
"early_originals" does have data, however it has many entries of the same date. "columns to remove" also has data. The data is a data frame, and both of the other conditions are also met. — P_Grace, Feb 01 '18 at 03:36

Identifying dates common to multiple IDs in R

2 Answers2