I have a very large dataset - around 20 million observations and this is its basic structure -
date time string
1 01/01/2020 20:00:00 A
2 01/01/2020 20:13:12 B
3 01/01/2020 20:37:45 C
4 01/01/2020 20:39:07 D
5 01/01/2020 20:41:29 A
6 01/01/2020 20:46:48 E
7 01/01/2020 21:00:00 J
I would like to have a new column, "mode" perhaps, that would calculate the most frequently occurring text string in the "string" column, but only on an hourly interval. So the table would end up with something like this -
date time string mode
1 01/01/2020 20:00:00 A
2 01/01/2020 20:13:12 B
3 01/01/2020 20:37:45 C
4 01/01/2020 20:39:07 D
5 01/01/2020 20:41:29 A
6 01/01/2020 20:46:48 E
7 01/01/2020 21:00:00 J A
8 01/01/2020 21:20:12 I
9 01/01/2020 21:38:32 I
10 01/01/2020 21:43:12 A
11 01/01/2020 21:49:50 I
12 01/01/2020 21:54:50 B
13 01/01/2020 22:00:00 A I
14 01/01/2020 22:03:45 B
so that every time the hour digit increments, a mode is taken of the string column, and then the mode measurement resets for the next hour-long interval.
I'm very new to R, so unfortunately I don't have any attempts or error messages to show. I've looked at many other similar topics / threads and haven't come across anything that has helped me find a viable solution. Certainly not asking for anyone to write code for me - just any suggestions at all would be very much appreciated.