0

Thank you, experts for previous answers (How to filter by range of dates in R?)

I am still having some problems dealing with the data.

Example:

id  q   date
a   1   01/01/2021
a   1   01/01/2021
a   1   21/01/2021
a   1   21/01/2021
a   1   12/02/2021
a   1   12/02/2021
a   1   12/02/2021
a   1   12/02/2021

My idea is to eliminate the observations that have more than 3 "units" in a period of 30 days. That is, if "a" has a unit "q" on "12/02/2021" [dd/mm]yyyy]: (a) if between 12/01/2021 and 12/02/2021 there are already 3 observations it must be deleted . (b) If there are less than 3 this one must remain.

My expected result is:

p   q   date
a   1   01/01/2021
a   1   01/01/2021
a   1   21/01/2021
a   1   12/02/2021
a   1   12/02/2021
a   1   12/02/2021

With this code:

df <- df %>%
  mutate(day = dmy(data))%>%
  group_by(p) %>%
  arrange(day, .by_group = TRUE) %>%
  mutate(diff = day - first(day)) %>%
  mutate(row = row_number()) %>%
  filter(row <= 3 | !diff < 30)

But the result is:

P   Q   DATE        DAY       DIFF  ROW
a   1   1/1/2021    1/1/2021    0   1
a   1   1/1/2021    1/1/2021    0   2
a   1   21/1/2021   21/1/2021   20  3
a   1   12/2/2021   12/2/2021   42  5
a   1   12/2/2021   12/2/2021   42  6
a   1   12/2/2021   12/2/2021   42  7
a   1   12/2/2021   12/2/2021   42  8

The main problem is that the diff variable must count days in periods of 30 days from the last day of the previous 30-days period - not since the first observation day.

Any help? Thanks

Amc
  • 131
  • 8
  • Can you explain why your expected results is as is? – Maël Feb 17 '22 at 12:18
  • 1
    Hi Mael, because my goal is to delete the observations above 3 within 30-days-period (12/2/2021 - 21/1/2021) not with respect to the first observation day (1/1/2021). In consequence, row 8 must be deleted. Thanks – Amc Feb 17 '22 at 12:33

1 Answers1

1

Using floor_date it is quite straighforward:

library(lubridate)
library(dplyr)
df %>% 
  group_by(floor = floor_date(date, '30 days')) %>% 
  slice_head(n = 3) %>% 
  ungroup() %>% 
  select(-floor)

# A tibble: 6 x 3
  id        q date      
  <chr> <int> <date>    
1 a         1 2021-01-01
2 a         1 2021-01-01
3 a         1 2021-01-21
4 a         1 2021-02-12
5 a         1 2021-02-12
6 a         1 2021-02-12

data

df <- read.table(header = T, text = "id  q   date
a   1   01/01/2021
a   1   01/01/2021
a   1   21/01/2021
a   1   21/01/2021
a   1   12/02/2021
a   1   12/02/2021
a   1   12/02/2021
a   1   12/02/2021")

df$date<-as.Date(df$date, format = "%d/%m/%Y")
Maël
  • 45,206
  • 3
  • 29
  • 67
  • Let me explain better myself. My idea is to eliminate the observations that have more than 3 "units" in a period of 30 days. That is, if "a" has a unit "q" on "12/02/2021": (a) if between 12/01/2021 and 12/02/2021 there are already 3 observations it must be deleted . (b) If there are less than 3 this one must remain. Thanks!! – Amc Feb 17 '22 at 13:08
  • I'm not sure I understand. Can you explain why the second observation of '21/01/2021" is deleted? – Maël Feb 17 '22 at 13:17
  • Because the second observation of "21/01/2021" represents the fourth unit of "q" in less that 30 days and maximum is 3. – Amc Feb 17 '22 at 13:21
  • Alright, and then after that do you set the counter to 0? I don't really understand how is that different from the answer. – Maël Feb 17 '22 at 13:28
  • Aps, now I understand! Thank you very much. An extra question: If I have more than one "id" (a,b,c,d,e) how I command the same process for each "id"? Many thanks, expert!! – Amc Feb 17 '22 at 13:39
  • You can add `id` in the `group_by` command. – Maël Feb 17 '22 at 13:42