1

I have a very large dataset - around 20 million observations and this is its basic structure -

           date       time      string
  1     01/01/2020   20:00:00     A  
  2     01/01/2020   20:13:12     B
  3     01/01/2020   20:37:45     C
  4     01/01/2020   20:39:07     D 
  5     01/01/2020   20:41:29     A
  6     01/01/2020   20:46:48     E
  7     01/01/2020   21:00:00     J

I would like to have a new column, "mode" perhaps, that would calculate the most frequently occurring text string in the "string" column, but only on an hourly interval. So the table would end up with something like this -

           date       time      string      mode
  1     01/01/2020   20:00:00     A          
  2     01/01/2020   20:13:12     B
  3     01/01/2020   20:37:45     C
  4     01/01/2020   20:39:07     D 
  5     01/01/2020   20:41:29     A
  6     01/01/2020   20:46:48     E
  7     01/01/2020   21:00:00     J          A
  8     01/01/2020   21:20:12     I
  9     01/01/2020   21:38:32     I  
  10    01/01/2020   21:43:12     A
  11    01/01/2020   21:49:50     I
  12    01/01/2020   21:54:50     B
  13    01/01/2020   22:00:00     A          I
  14    01/01/2020   22:03:45     B

so that every time the hour digit increments, a mode is taken of the string column, and then the mode measurement resets for the next hour-long interval.

I'm very new to R, so unfortunately I don't have any attempts or error messages to show. I've looked at many other similar topics / threads and haven't come across anything that has helped me find a viable solution. Certainly not asking for anyone to write code for me - just any suggestions at all would be very much appreciated.

2 Answers2

0

An option is to use the Mode function from here

 Mode <- function(x) {
       ux <- unique(x)
     ux[which.max(tabulate(match(x, ux)))]
}

Create a grouping variable by pasteing the 'date', 'time' columns, convert it to Datetime class with dmy_hms (from lubridate), then use ceiling_date specifying '1 hour' as unit, create the 'mode' column by applying the Mode on the 'string' column, and use case_when to return that value only at the last row for each group

library(dplyr)
library(lubridate
library(stringr)
df1 %>% 
     group_by(grp = ceiling_date(dmy_hms(str_c(date, time, sep=" ")),
          '1 hour')) %>% 
     mutate(mode = case_when(row_number() == n() ~ Mode(string), 
                TRUE  ~ "")) %>%
     ungroup %>% 
     select(-grp)

-output

# A tibble: 14 x 4
#   date       time     string mode 
#   <chr>      <chr>    <chr>  <chr>
# 1 01/01/2020 20:04:01 A      ""   
# 2 01/01/2020 20:13:12 B      ""   
# 3 01/01/2020 20:37:45 C      ""   
# 4 01/01/2020 20:39:07 D      ""   
# 5 01/01/2020 20:41:29 A      ""   
# 6 01/01/2020 20:46:48 E      ""   
# 7 01/01/2020 21:00:00 J      "A"  
# 8 01/01/2020 21:20:12 I      ""   
# 9 01/01/2020 21:38:32 I      ""   
#10 01/01/2020 21:43:12 A      ""   
#11 01/01/2020 21:49:50 I      ""   
#12 01/01/2020 21:54:50 B      ""   
#13 01/01/2020 22:00:00 A      "I"  
#14 01/01/2020 22:03:45 B      "B"

data

df1 <- structure(list(date = c("01/01/2020", "01/01/2020", "01/01/2020", 
"01/01/2020", "01/01/2020", "01/01/2020", "01/01/2020", "01/01/2020", 
"01/01/2020", "01/01/2020", "01/01/2020", "01/01/2020", "01/01/2020", 
"01/01/2020"), time = c("20:04:01", "20:13:12", "20:37:45", "20:39:07", 
"20:41:29", "20:46:48", "21:00:00", "21:20:12", "21:38:32", "21:43:12", 
"21:49:50", "21:54:50", "22:00:00", "22:03:45"), string = c("A", 
"B", "C", "D", "A", "E", "J", "I", "I", "A", "I", "B", "A", "B"
)), class = "data.frame", row.names = c("1", "2", "3", "4", "5", 
"6", "7", "8", "9", "10", "11", "12", "13", "14"))
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Thank you so much for the wizardry on this one! I really liked your solution and it executes given the data, but I was unable to get it to work on my df. The other solution did so (due to time constraints) I'm going to implement that one. I think perhaps my df is somehow structured a little differently than the input you're using - although I can't find the difference. Your time is very, very much appreciated! – cosmocoder1 May 31 '21 at 16:58
0

Using library(data.table) we can do

setDT(df1)[, hour := paste(date, sub(':.+','', time))]
df1[, n := seq(.N), by = .(hour, string)]
df1[, mode := string[which.max(n)], by=hour]
dww
  • 30,425
  • 5
  • 68
  • 111
  • Thanks so much for this - it works well and gives me what I'm looking for! Very much appreciated. I realized when I saw both of these answers just how out my league this maneuver was so much gratitude to both who answered! – cosmocoder1 May 31 '21 at 17:00