3

I'm trying to analyze game data but I need to remove of all rows after a specified row.

In the following case I want to remove all rows after the EVENT "Die" for each users. Data is sorted by UID, TIME.HOUR.

df:

 UID  TIME.HOUR  EVENT
  1      5       Run
  1      5       Run
  1      6       Run
  1      7       Die
  1      8       Run
  1      9       Run
  2      14      Jump
  2      15      Die
  2      16      Run
  2      17      Run

Expected result:

 UID  TIME.HOUR  EVENT
  1      5       Run
  1      5       Run
  1      6       Run
  1      7       Die
  2      14      Jump
  2      15      Die

I think i'm on the right track with the code below but don't struggling with the next step.

 args <- which(df$EVENT== "Die")
 df[,c(sapply(args, function(x) ???), by = UID] #seq? range? 

Thank you.

ant
  • 565
  • 1
  • 6
  • 12

3 Answers3

4

We can use data.table. Convert the 'data.frame' to 'data.table', grouped by 'UID', get a double cumsum of logical vector (EVENT == "Die"), check whether it is less than 2 to Subset the Data.table (.SD)

library(data.table)
setDT(df)[, .SD[cumsum(cumsum(EVENT=="Die"))<2] , UID]
#   UID TIME.HOUR EVENT
#1:   1         5   Run
#2:   1         5   Run
#3:   1         6   Run
#4:   1         7   Die
#5:   2        14  Jump
#6:   2        15   Die

Or a faster approach: to get the row index, extract that column ($V1) to subset the data

setDT(df)[df[, .I[cumsum(cumsum(EVENT=="Die"))<2] , UID]$V1]

Or a modification of @Psidom's approach

setDT(df)[df[, .I[seq(match("Die", EVENT, nomatch = .N))] , UID]$V1]

Or use dplyr

library(dplyr)
df %>%
   group_by(UID) %>% 
   slice(seq(match("Die", EVENT, nomatch = n())))
#    UID TIME.HOUR EVENT
#  <int>     <int> <chr>
#1     1         5   Run
#2     1         5   Run
#3     1         6   Run
#4     1         7   Die
#5     2        14  Jump
#6     2        15   Die

In case, we need a data.frame output, chain with %>% as.data.frame (from @R.S. comments)

Community
  • 1
  • 1
akrun
  • 874,273
  • 37
  • 540
  • 662
  • This about `dplyr` part-- That's nifty, and quite flexible. The` slice` part needs `seq` because it will otherwise return only the matching row. Did I get it right? And `nomatch=n()` means go till end in case of no match? BTW, maybe you can also suggest adding `%>% ungroup() %>% as.data.frame() ` at the end to get a cleaned up dataframe . – R.S. Dec 07 '16 at 03:04
  • @R.S. You are right, it will only get the matching row if you don't use the `seq`, and `nomatch = n()` is for special cases where there are no matches for a particular UID. Some people prefer to have their output as `tbl_df`. So, I will leave it as that and add your comments – akrun Dec 07 '16 at 03:08
  • 1
    Ah. Thanks. The insight helps. – R.S. Dec 07 '16 at 03:09
3

This probably isn't so efficient, but you could do a fancy join:

mdf = df[EVENT == "Die", head(.SD, 1L), by=UID, .SDcols = "TIME.HOUR"]
df[!mdf, on=.(UID, TIME.HOUR > TIME.HOUR)]

   UID TIME.HOUR EVENT
1:   1         5   Run
2:   1         5   Run
3:   1         6   Run
4:   1         7   Die
5:   2        14  Jump
6:   2        15   Die

You don't actually need to save the mdf table as a separate object, of course.


How it works

  • x[!i], where i is another data.table or list of vectors, is an anti-join, telling R to exclude rows of x based on i, similar to how it works with vectors (where i would have to be a logical vector).

  • The on=.(ID, v >= v) option tells R that we're doing a "non-equi join." The v >= v part means that the v from i (on the left-hand side) should be greater than the v from x (on the right-hand side).

Combining these two, we're excluding rows that meet the criteria specified in the on=.


Side notes. There are a couple things I'm not sure about: Do we have a better name than non-equi join? Why is the v on the left from i even though x[i] has x to the left of i?

I borrowed from both Psidom and akrun's answer, using head and an inequality. One (maybe?) advantage here is that head(.SD, 1L) is optimized while head(.SD, expr) is not yet.

Frank
  • 66,179
  • 8
  • 96
  • 180
2

Another option, you can use head() with match()(find the first Die index):

dt[, head(.SD, match("Die", EVENT, nomatch = .N)), UID]   # if no match is found within the 
                                                          # group, return the whole group

#   UID TIME.HOUR EVENT
#1:   1         5   Run
#2:   1         5   Run
#3:   1         6   Run
#4:   1         7   Die
#5:   2        14  Jump
#6:   2        15   Die
Psidom
  • 209,562
  • 33
  • 339
  • 356