I've got a data frame df
(+/- 331000 observations with 4 variables) with Date
(range in format = "%Y-%m-%d"), ID
(factor with 19 levels), Station
(factor with 18 levels), and Presence
(1/0
).
The data frame is setup in such a way that there's a range of dates (over an almost three year period) for each ID
at each Station
, and whether a person was present (1/0
) on a particular day at a particular Station.
If one would subset/filter the df according to a day and ID, you'd get the following dataset (I'll refer to this from now on as 'group'):
filter(df, Date == "2016-01-03" & ID == "Fred")
Date ID Station Presence
<date> <fct> <fct> <dbl>
2016-01-03 Fred Station1 0
2016-01-03 Fred Station2 0
2016-01-03 Fred Station3 0
2016-01-03 Fred Station4 1
2016-01-03 Fred Station5 0
2016-01-03 Fred Station6 0
2016-01-03 Fred Station7 0
2016-01-03 Fred Station8 0
2016-01-03 Fred Station9 0
2016-01-03 Fred Station10 0
2016-01-03 Fred Station11 0
2016-01-03 Fred Station12 0
2016-01-03 Fred Station13 0
2016-01-03 Fred Station14 0
2016-01-03 Fred Station15 0
2016-01-03 Fred Station16 0
2016-01-03 Fred Station17 0
2016-01-03 Fred Station18 0
I would like to remove rows from the group if the following conditions are met:
For each group, if df$Presence == 1
, remove rows with df$Presence == 0
(it is possible to have rows with multiple df$Presence == 1
within one group, e.g. Fred was at Station4, Station9 and Station 15 on 2016-01-06). But if there are no rows with df$Presence == 1
within the group, don't remove any of the rows (so I can't simply remove all the df$Presence == 0
rows).
The above group would thus become:
Date ID Station Presence
<date> <fct> <fct> <dbl>
2016-01-03 Fred Station4 1
However, the following group would stay as it is (as there are no Presence == 1
within the group):
filter(df, Date== "2016-01-03" & ID == "Mark")
Date ID Station Presence
<date> <fct> <fct> <dbl>
2016-01-03 Mark Station1 0
2016-01-03 Mark Station2 0
2016-01-03 Mark Station3 0
2016-01-03 Mark Station4 0
2016-01-03 Mark Station5 0
2016-01-03 Mark Station6 0
2016-01-03 Mark Station7 0
2016-01-03 Mark Station8 0
2016-01-03 Mark Station9 0
2016-01-03 Mark Station10 0
2016-01-03 Mark Station11 0
2016-01-03 Mark Station12 0
2016-01-03 Mark Station13 0
2016-01-03 Mark Station14 0
2016-01-03 Mark Station15 0
2016-01-03 Mark Station16 0
2016-01-03 Mark Station17 0
2016-01-03 Mark Station18 0
I've thought of starting with
df %>% group_by(Date, ID) %>%
However, I don't know how to proceed from there. I don't know how to deal with the contrasting conditions.