1

I have a data frame in R called info which includes several dates under the column Date, they are ordered in "%Y-%m-%d" I want to only have those values that are less then 6 days apart and remove the "outliers" anyone know how this can be done?

what the data frame looks like

'> info
           Date   ens seps
3    1951-01-08 mem01    2
4    1951-01-12 mem01    4
37   1959-12-08 mem01    4
42   1959-12-30 mem01    3
43   1960-01-01 mem01    2
47   1961-01-03 mem01    2
49   1961-01-18 mem01    2
50   1961-01-20 mem01    2
62   1964-11-29 mem01    4
93   1971-02-12 mem01    2
99   1972-02-15 mem01    2
100  1972-02-18 mem01    3
102  1972-02-21 mem01    2
119  1981-10-16 mem01    3
121  1981-10-19 mem01    2
131  1984-12-24 mem01    2
134  1987-01-02 mem01    2
Judith
  • 147
  • 8
  • 2
    Please, provide a reproducible example: https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – Terru_theTerror Mar 22 '18 at 09:22
  • 2
    Besides the need of a reproducible example, one of the approaches you should follow is very straightforward: compute the difference between each two consecutive records using the `lag()` function. Then, based on the value of this new variable, you can easily choose to remove all those records that have a 6 days difference from the previous record. – Seymour Mar 22 '18 at 09:35

2 Answers2

0

If I understood the question correctly then you can try

library(dplyr)

df %>%
  arrange(Date) %>%
  mutate(date_diff = as.numeric(Date - lag(Date))) %>%
  filter(date_diff < 6 | lead(date_diff) < 6) %>%
  select(-date_diff)

Output is:

         Date   ens seps
1  1951-01-08 mem01    2
2  1951-01-12 mem01    4
3  1959-12-30 mem01    3
4  1960-01-01 mem01    2
5  1961-01-18 mem01    2
6  1961-01-20 mem01    2
7  1972-02-15 mem01    2
8  1972-02-18 mem01    3
9  1972-02-21 mem01    2
10 1981-10-16 mem01    3
11 1981-10-19 mem01    2

Sample data:

df <- structure(list(Date = structure(c(-6933, -6929, -3677, -3655, 
-3653, -3285, -3270, -3268, -1859, 407, 775, 778, 781, 4306, 
4309, 5471, 6210), class = "Date"), ens = c("mem01", "mem01", 
"mem01", "mem01", "mem01", "mem01", "mem01", "mem01", "mem01", 
"mem01", "mem01", "mem01", "mem01", "mem01", "mem01", "mem01", 
"mem01"), seps = c(2L, 4L, 4L, 3L, 2L, 2L, 2L, 2L, 4L, 2L, 2L, 
3L, 2L, 3L, 2L, 2L, 2L)), .Names = c("Date", "ens", "seps"), row.names = c("3", 
"4", "37", "42", "43", "47", "49", "50", "62", "93", "99", "100", 
"102", "119", "121", "131", "134"), class = "data.frame")
Prem
  • 11,775
  • 1
  • 19
  • 33
0

A possibility using base R would be the following.

inx <- c(TRUE, diff(info$Date) < 6)
new_info <- info[inx, ]
new_info 
#          Date   ens seps
#3   1951-01-08 mem01    2
#4   1951-01-12 mem01    4
#43  1960-01-01 mem01    2
#50  1961-01-20 mem01    2
#100 1972-02-18 mem01    3
#102 1972-02-21 mem01    2
#121 1981-10-19 mem01    2
Rui Barradas
  • 70,273
  • 8
  • 34
  • 66