Subsetting datasets in R using presence of keywords

Question

I have this dataset:

> dput(SampleEvents)
structure(list(Event = structure(c(10L, 5L, 6L, 11L, 10L, 7L, 
11L, 8L, 9L, 10L, 1L, 2L, 3L, 4L, 11L), .Label = c("e10", "e11", 
"e12", "e13", "e2", "e3", "e6", "e8", "e9", "Login", "Logout"
), class = "factor"), Transaction.ID = structure(c(NA, 1L, NA, 
2L, NA, NA, NA, NA, 3L, NA, NA, NA, NA, NA, NA), .Label = c("t1", 
"t4", "t5"), class = "factor"), User.ID = structure(c(1L, 1L, 
1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("kenn1", 
"kenn2"), class = "factor"), Event.Date = structure(c(1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "20/5/2017", class = "factor"), 
    Event.Time = structure(c(12L, 13L, 14L, 15L, 1L, 2L, 3L, 
    4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L), .Label = c("10:01", "10:02", 
    "10:03", "10:04", "10:05", "10:06", "10:07", "10:08", "10:09", 
    "10:10", "10:11", "9:00", "9:30", "9:45", "9:50"), class = "factor")), .Names = c("Event", 
"Transaction.ID", "User.ID", "Event.Date", "Event.Time"), class = "data.frame", row.names = c(NA, 
-15L))

I would like to remove all rows within two fixed values under the column "Event", i.e. rows from "Login" to "Logout", which have all missing Transaction ID values between "Login" and "Logout":

I would also like to retain the current order of the dataset.

How do I do this in R?

Please do not provide your data as an image. For anyone to use it they would need to type it all in again. Instead, please use a text format. Ideally, you would have the data in R and use `dput` to provide the data structure. Your image looks like it is from Excel. _At least_ you could save as a csv and paste that into your question. — G5W, May 25 '17 at 19:36
I am a bit new to this site. Let me try attaching the dataset. — Kenneth Singh, May 25 '17 at 19:37
Yes, I am trying to have a sample here and understand the logic so that I can apply it in my parent dataset. — Kenneth Singh, May 25 '17 at 19:43

Andrew Gustar · Accepted Answer · 2017-05-26T07:27:43.037

0

You could do something like the following, which works with the data you supplied...

library(dplyr)

#add variables to mark login-logout blocks and number them
df <- df %>% mutate(session=cumsum(Event=="Login")-cumsum(Event=="Logout"),
                    block=c(0,cumsum(diff(session)!=0)),
                    block=ifelse(Event=="Logout",block-1,block))

#identify blocks to remove
df2 <- df %>% group_by(block) %>% 
              summarise(Login=first(session)>0,
                        noTrans=all(is.na(Transaction.ID))) %>% 
              filter(Login & noTrans)

#remove unwanted blocks and delete temporary variables
df <- df %>% filter(!(block %in% df2$block)) %>% 
             select(-c(session,block))

df
   Event Transaction.ID User.ID Event.Date Event.Time
1  Login           <NA>   kenn1  20/5/2017       9:00
2     e2             t1   kenn1  20/5/2017       9:30
3     e3           <NA>   kenn1  20/5/2017       9:45
4 Logout             t4   kenn1  20/5/2017       9:50
5     e8           <NA>   kenn2  20/5/2017      10:04
6     e9             t5   kenn2  20/5/2017      10:05

edited May 26 '17 at 07:27

answered May 25 '17 at 19:47

Andrew Gustar

17,295
1
22
32

I am getting a warning message after running the second code: "Warning message: In Ops.factor(10L, 0) : ‘>’ not meaningful for factors" If I ignore this and run the third code, I get a dataset with zero observations. – Kenneth Singh May 25 '17 at 19:59
Check you are using the version above - I spotted some mistakes and made a few edits to it! – Andrew Gustar May 25 '17 at 20:06
It's not giving any warning message now, but there seems to be no change in the dataset. – Kenneth Singh May 25 '17 at 20:12
It is impossible to tell without seeing your data in its actual format, but it could be something like rogue spaces - e.g. `"Login "` in the Event column, or the blanks in TransactionID being `" "` rather than `""`. – Andrew Gustar May 25 '17 at 20:16
How do I share this dataset? Can I just mail the csv? – Kenneth Singh May 25 '17 at 20:18
On a side now, I am noticing that the set we want to remove, df2 is returning zero observations. – Kenneth Singh May 25 '17 at 20:19
See here for asking good questions... https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example. If df2 is blank it could well be rogue spaces. – Andrew Gustar May 25 '17 at 20:26
The "block" column that we created is showing all NA values. – Kenneth Singh May 25 '17 at 20:26
Just added the dput output. – Kenneth Singh May 25 '17 at 20:31
Just amended the solution to one that works! – Andrew Gustar May 26 '17 at 07:28
1

This works. Thanks a lot! – Kenneth Singh May 26 '17 at 15:20
One follow up question. After getting this updated dataset, if I wish to have only those chunks of rows between Login and Logout, how would I do that? In our case, I would like to remove rows with Events e8 and e9 since they do not come under any Login to Logout chunk. – Kenneth Singh May 26 '17 at 18:23
You could change the last line under `df2` to `filter(!Login | noTrans)` – Andrew Gustar May 26 '17 at 18:33

Subsetting datasets in R using presence of keywords

1 Answers1