0

I am fairly new to R and I have a question on how to keep only certain values based on an ID and a date. I have a (quite big) dataset that looks like the following example:

ID Type Date
1  OUT 2016-06-18
1  OUT 2016-06-18
1  OUT 2016-06-18
1  IN  2016-06-25
1  OUT 2016-06-25
2  IN  2016-07-03
2  OUT 2016-07-03

My question now is how can I find dates that contain ONLY one of the types (IN or OUT) and remove those from the data. I would however like to keep the date if the type is a pair (IN and OUT) and if the ID value is the same.

Is there a way how to do this in R?

Uwe
  • 41,420
  • 11
  • 90
  • 134
xathrin
  • 21
  • 2
  • Possible duplicate of [How to efficiently filter a data frame?](https://stackoverflow.com/questions/7106330/how-to-efficiently-filter-a-data-frame) – Wimpel Oct 16 '18 at 19:06
  • It'll help if you can also provide the expected output of your test data set. – Shree Oct 16 '18 at 19:28

3 Answers3

1

If I understood your requirement correctly, here's a simple way using dplyr package -

df %>%
  group_by(ID, Date) %>%
  filter(n_distinct(Type) > 1)

# A tibble: 4 x 3
# Groups:   ID, Date [2]
     ID Type  Date      
  <int> <chr> <chr>     
1     1 IN    2016-06-25
2     1 OUT   2016-06-25
3     2 IN    2016-07-03
4     2 OUT   2016-07-03

Another way using ave() from base R -

df[with(df, ave(Type, ID, Date, FUN = function(x) length(unique(x)))) == 2, ]

  ID Type       Date
4  1   IN 2016-06-25
5  1  OUT 2016-06-25
6  2   IN 2016-07-03
7  2  OUT 2016-07-03
Shree
  • 10,835
  • 1
  • 14
  • 36
0

Here's a way to do this with dplyr. This looks for all ID + Date combos that have at least one of each In and Out.

has_both <- df1 %>%
  count(ID, Date, Type) %>%  # How many rows with each combo ID / Date / Type
  count(ID, Date) %>% # How many rows appear for each ID / Date
  filter(nn == 2) %>% # Only keep where 2 types (IN and OUT, presumably)
  left_join(df1)  %>% # Bring back matching original data

Output

has_both
# A tibble: 4 x 4
     ID Date          nn Type 
  <int> <chr>      <int> <chr>
1     1 2016-06-25     2 IN   
2     1 2016-06-25     2 OUT  
3     2 2016-07-03     2 IN   
4     2 2016-07-03     2 OUT 
Jon Spring
  • 55,165
  • 4
  • 35
  • 53
0

For the sake of completeness, here are also some solutions:

library(data.table)

setDT(df)[, if (uniqueN(Type) > 1) .SD, by = .(ID, Date)]
   ID       Date Type
1:  1 2016-06-25   IN
2:  1 2016-06-25  OUT
3:  2 2016-07-03   IN
4:  2 2016-07-03  OUT

Within each ID, Date group only those subsets of df are returned for which there is more than one distinct Type.


This can also be written as:

setDT(df)[, .SD[uniqueN(Type) > 1], by = .(ID, Date)]

There is also a variant which finds ID and Date combinations which fulfill the requirement and subsets df by joining:

setDT(df)[df[, uniqueN(Type), by = .(ID, Date)][V1 > 1], on = .(ID, Date), .SD]
   ID Type       Date
1:  1   IN 2016-06-25
2:  1  OUT 2016-06-25
3:  2   IN 2016-07-03
4:  2  OUT 2016-07-03

Data

df <-readr::read_delim(
"ID Type Date
1  OUT 2016-06-18
1  OUT 2016-06-18
1  OUT 2016-06-18
1  IN  2016-06-25
1  OUT 2016-06-25
2  IN  2016-07-03
2  OUT 2016-07-03", 
delim = " ", trim_ws = TRUE)
Community
  • 1
  • 1
Uwe
  • 41,420
  • 11
  • 90
  • 134