How can I group rows together based upon certain conditions? (R or Python)

Question

Objective:

Group sections of data together if the Subject, Re, and Length columns have identical consecutive values, and if Folder is == "out" | "drafts", Message == "", and Edit is == "T" and take their duration.

Subject Re                    Length         Folder      Message   Date                   Edit     
        a@mail.com,b@mail.com 80             out                   1/2/2020 1:00:01 AM     T                               
        a@mail.com,b@mail.com 80             out                   1/2/2020 1:00:05 AM     T                        
hey     a@mail.com,b@mail.com 80             out                   1/2/2020 1:00:10 AM     T                        
hey     a@mail.com,b@mail.com 80             out                   1/2/2020 1:00:15 AM     T                        
hey     a@mail.com,b@mail.com 80             out                   1/2/2020 1:00:30 AM     T 



hey     a@mail.com,b@mail.com 80            draft                  1/2/2020 1:02:00 AM     T                        
hey     a@mail.com,b@mail.com 80            draft                  1/2/2020 1:02:05 AM     T                        




hey     a@mail.com,b@mail.com 80             out                   1/2/2020 1:03:10 AM     T                        
hey     a@mail.com,b@mail.com 80             out                   1/2/2020 1:03:20 AM     T

Desired Output

 Start                  End                        Duration          Group
 1/2/2020 1:00:10 AM    1/2/2020 1:00:30 AM        20                A
 1/2/2020 1:02:00 AM    1/2/2020 1:02:05 AM        5                 A
 1/2/2020 1:03:10 AM    1/2/2020 1:03:20 AM        10                A

dput:

 structure(list(Subject = structure(c(1L, 1L, 2L, 2L, 2L, 1L, 
 1L, 2L, 2L, 1L, 1L, 2L, 2L), .Label = c("", "hey"), class = "factor"), 
 Recipient = structure(c(1L, 1L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 
 1L, 1L, 2L, 2L), .Label = c("", "sarah@mail.com,gee@mail.com"
 ), class = "factor"), Length = c(80L, 80L, 80L, 80L, 80L, 
 NA, NA, 80L, 80L, NA, NA, 100L, 100L), Folder = structure(c(3L, 
 3L, 3L, 3L, 3L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L), .Label = c("", 
 "draft", "out"), class = "factor"), Message = c(NA, NA, NA, 
 NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), Date = structure(c(2L, 
 3L, 4L, 5L, 6L, 1L, 1L, 7L, 8L, 1L, 1L, 9L, 10L), .Label = c("", 
 "1/2/2020 1:00:01 AM", "1/2/2020 1:00:05 AM", "1/2/2020 1:00:10 AM", 
 "1/2/2020 1:00:15 AM", "1/2/2020 1:00:30 AM", "1/2/2020 1:02:00 AM", 
 "1/2/2020 1:02:05 AM", "1/2/2020 1:03:00 AM", "1/2/2020 1:03:20 AM"
 ), class = "factor"), Edit = c(TRUE, TRUE, TRUE, TRUE, TRUE, 
 NA, NA, TRUE, TRUE, NA, NA, TRUE, TRUE)), class = "data.frame", row.names = c(NA, 
-13L))

I know that I can filter like this:

   df1<-df2 %>%
   mutate(Date = lubridate::mdy_hms(Date), 
    cond = Edit == "T" & ItemFolderName == "out" | Folder == "drafts" &     Message == "" & Subject     ==  ?   & Re ==   ?     & Length == ?

But unsure of how to incorporate "if there are consecutive values". I will continue to research, any help or suggestion is greatly appreciated.

score 1 · Accepted Answer · answered Feb 27 '20 at 22:45

Your structure looks a little different than the data frame you posted:

> df
   Subject                   Recipient Length Folder Message                Date Edit
1                                          80    out      NA 1/2/2020 1:00:01 AM TRUE
2                                          80    out      NA 1/2/2020 1:00:05 AM TRUE
3      hey sarah@mail.com,gee@mail.com     80    out      NA 1/2/2020 1:00:10 AM TRUE
4      hey sarah@mail.com,gee@mail.com     80    out      NA 1/2/2020 1:00:15 AM TRUE
5      hey sarah@mail.com,gee@mail.com     80    out      NA 1/2/2020 1:00:30 AM TRUE
6                                          NA             NA                       NA
7                                          NA             NA                       NA
8      hey sarah@mail.com,gee@mail.com     80  draft      NA 1/2/2020 1:02:00 AM TRUE
9      hey sarah@mail.com,gee@mail.com     80  draft      NA 1/2/2020 1:02:05 AM TRUE
10                                         NA             NA                       NA
11                                         NA             NA                       NA
12     hey sarah@mail.com,gee@mail.com    100  draft      NA 1/2/2020 1:03:00 AM TRUE
13     hey sarah@mail.com,gee@mail.com    100  draft      NA 1/2/2020 1:03:20 AM TRUE

Also, your desired output suggests that you want to split groups by Folder as well as by the other categories, but that isn't what your description says, so I didn't group by Folder. That's easy to change if you want, though.

You can disambiguate groups of identical consecutive values in sorted data by using run length encoding, but rle in R is a little tricky to turn into a data frame column. I used this answer to accomplish that.

library(lubridate)
library(dplyr)

df %>%
  mutate(Date = mdy_hms(Date),
         Key = paste(Subject, Recipient, Length, sep = "_")) %>%
  arrange(Date) %>%
  filter(Folder == "out" | Folder == "draft" & Edit == TRUE) %>%
  mutate(RLE = {RLE = rle(Key) ; rep(seq_along(RLE$lengths), RLE$lengths)}) %>%
  group_by(RLE) %>%
  summarize(Start = first(Date),
            End = last(Date),
            Duration = as.numeric(End) - as.numeric(Start))

This creates groups from rows 1:2, 3:5+8:9, and 12:13. Those groups give these durations:

# A tibble: 3 x 4
    RLE Start               End                 Duration
  <int> <dttm>              <dttm>                 <dbl>
1     1 2020-01-02 01:00:01 2020-01-02 01:00:05        4
2     2 2020-01-02 01:00:10 2020-01-02 01:02:05      115
3     3 2020-01-02 01:03:00 2020-01-02 01:03:20       20

If you want Folder included in the grouping, add it to what is included in the creation of Key. That makes the groups 1:2, 3:5, 8:9, and 12:13. Doing so gives this result:

# A tibble: 4 x 4
    RLE Start               End                 Duration
  <int> <dttm>              <dttm>                 <dbl>
1     1 2020-01-02 01:00:01 2020-01-02 01:00:05        4
2     2 2020-01-02 01:00:10 2020-01-02 01:00:30       20
3     3 2020-01-02 01:02:00 2020-01-02 01:02:05        5
4     4 2020-01-02 01:03:00 2020-01-02 01:03:20       20

Thank you @Ajar is there a way to also show a Group with this, or to display the Subject or recipients with this? Like Group A etc For instance, these 4 rows would all belong to Group A for instance — Lynn, Feb 27 '20 at 23:01
Please help, how do I make these belong to the sane group when the last rows subject , recipient and length columns match with the first row of the next blocks subject, recipient and length column — Lynn, Feb 27 '20 at 23:25
The `RLE` column already uniquely identifies each group. If you don't run the `group_by` and everything that comes after it, you'll still have the full dataset but with the groups uniquely identified. — Ajar, Feb 28 '20 at 15:16
Ok thank you ,@Ajar The reason I mention this is because the first part of the block that has blank subject fields is actually apart of that first block of data. I am trying to figure out how to label it as the SAME group. The whole group starts at 2020-01-02 01:00:01 and ends at 2020-01-02 01:00:30 — Lynn, Feb 28 '20 at 15:21
How do you know it's part of the same group? Whatever information you rely on to know that is what you need to add to the creation of the `Key` that is used to produce the run length encodings. — Ajar, Feb 28 '20 at 16:52

How can I group rows together based upon certain conditions? (R or Python)

1 Answers1