0

I want to eliminate rows that are repited in COD but only those who have a diference in time less than 5 minutes, for example. And I want that one row of the repited COD which satisfy the condition remains. I would like that remain the last one. IF I have this data:

  COD  |   Time               | score   | position |
-------|----------------------|---------|----------|
  xx4  | 2016-07-19 10:15:30  |5452     |  2454    |
  xf5  | 2016-07-19 09:23:30  |5321     |  342     |
  xr1  | 2016-07-19 12:15:30  |5232     |  2328    |
  xx4  | 2016-07-19 11:20:20  |1322     |  2432    |
  xx4  | 2016-07-19 10:18:30  |2344     |  2534    |
  xr1  | 2016-07-19 12:17:30  |8676     |  4566    |
  xx4  | 2016-07-19 10:15:50  |9445     |  7655    |

The result i looking for:

  COD  |   Time               | score   | position |
-------|----------------------|---------|----------|
  xx4  | 2016-07-19 10:15:30  |5452     |  2454    |
  xf5  | 2016-07-19 09:23:30  |5321     |  342     |
  xr1  | 2016-07-19 12:15:30  |5232     |  2328    |
  xx4  | 2016-07-19 11:20:20  |1322     |  2432    |

The time is in format POSIXct .How can I do this in R?

Matias.10
  • 53
  • 6

2 Answers2

1

You can do this in the dplyr suite. Group by COD and then use the lag() function to compare one time to the previous time.

new_data <- orig_data %>%
group_by(COD)%>% 
arrange(Time) %>%
mutate(timediff = difftime(Time,lag(Time), units = "mins"),
     too_soon = timediff<5, 
     too_soon = ifelse(is.na(too_soon), FALSE, too_soon)) %>%
filter(too_soon == FALSE) %>%
select(-too_soon)

(edit: caught NA's for the first row in each group, and used base::difftime() )

BarkleyBG
  • 664
  • 5
  • 16
  • FYI: `ifelse(Time - lag(Time) < 5, TRUE, FALSE)` is the same as `Time - lag(Time) < 5` – Jaap Jul 19 '16 at 20:38
  • Yes, thanks! Sometimes I just like seeing the if/then statement for clarity. (Although with the new `if_else()` function in dplyr, I could specify `if_else(Time-lag(Time)<5, TRUE, FALSE, FALSE)` and skip the NA issue altogether. ) – BarkleyBG Jul 19 '16 at 20:48
1

Using dplyr:

library(dplyr)

df %>% group_by(COD) %>% arrange(Time) %>% 
        mutate(Keep = ifelse(abs(difftime(Time, lag(Time), units = "mins")) > 5, T, F)) %>%
        filter(is.na(Keep) | Keep == T) %>% select(-Keep)


Source: local data frame [4 x 4]
Groups: COD [3]

     COD                Time score position
  <fctr>              <time> <int>    <int>
1    xf5 2016-07-19 09:23:30  5321      342
2    xx4 2016-07-19 10:15:30  5452     2454
3    xx4 2016-07-19 11:20:20  1322     2432
4    xr1 2016-07-19 12:15:30  5232     2328

Data

structure(list(COD = structure(c(3L, 1L, 2L, 3L, 3L, 2L, 3L), .Label = c("xf5", 
"xr1", "xx4"), class = "factor"), Time = structure(c(1468937730, 
1468934610, 1468944930, 1468941620, 1468937910, 1468945050, 1468937750
), class = c("POSIXct", "POSIXt"), tzone = ""), score = c(5452L, 
5321L, 5232L, 1322L, 2344L, 8676L, 9445L), position = c(2454L, 
342L, 2328L, 2432L, 2534L, 4566L, 7655L)), .Names = c("COD", 
"Time", "score", "position"), row.names = c(NA, -7L), class = "data.frame")
Sumedh
  • 4,835
  • 2
  • 17
  • 32
  • I dont now yet how to use dput properly, but your solution works. Can I add another column in the group_by? For example, Score? Thank you so much – Matias.10 Jul 19 '16 at 21:05
  • @Matias.10, please take a look [here](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). Just do `dput(YourDataframe)` and the output would be similar to something I posted in my solution (as Data). Why would you want to add `Score` in `group_by`? Could you explain your intention? – Sumedh Jul 19 '16 at 21:38
  • So that do not eliminate the ones that have different scores, but I already figured out, it is just add that column in the group_by. Can you tell me how can I modify the code so I can keep the result that have the last hour? I posted that it does not matter which row remains but now I need to keep the last measure. – Matias.10 Jul 20 '16 at 20:58
  • Last hour or the latest (last) time? If its the latter, just use `arrange(desc(Time))` instead of `arrange(Time)`. – Sumedh Jul 20 '16 at 21:16