2

I have dataset of hierarchical events where there is one row for one event.

TIME               level1   level2  Occurrence
29/11/2019 00:05    A       a       1
29/11/2019 00:05    B       a       1
29/11/2019 00:07    B       b       1
29/11/2019 00:20    B       b       1
29/11/2019 00:05    B       c       1
29/11/2019 01:20    A       a       1
29/11/2019 01:25    A       a       1
29/11/2019 02:00    A       a       2
29/11/2019 02:00    B       a       1
29/11/2019 02:00    B       b       1
29/11/2019 02:35    B       b       1
29/11/2019 02:49    B       c       1

I am aggregating it with Pandas groupby and grouper to get an output as below

df_agg = df.groupby([pd.Grouper(freq='H'), 'level1', pd.Grouper('level2')])
df_agg.count()
TIME               level1   level2  Count
29/11/2019 00:00    A       a       1
                    B       a       1
                    B       b       2
                    B       c       1
29/11/2019 01:00    A       a       2
29/11/2019 02:00    A       a       2
                    B       a       1
                    B       b       2
                    B       c       1

Can I achieve something similar in R?

I am attaching a commands to create the dataset similar to what i am working

dict = {"TIME" : ['29/11/2019  00:05:00', '29/11/2019  00:05:00', '29/11/2019  00:07:00', '29/11/2019  00:20:00',
                 '29/11/2019  00:05:00', '29/11/2019  01:20:00', '29/11/2019  01:25:00', '29/11/2019  02:00:00',
                 '29/11/2019  02:00:00', '29/11/2019  02:00:00', '29/11/2019  02:35:00', '29/11/2019  02:49:00'],
        "level1" : ["A", "B", "B", "B", "B", "A", "A", "A", "B","B", "B", "B"],
        "level2" : ["a", "a", "b", "b", "c", "a", "a", "a", "a", "b", "b","c"]}

tmp_df = pd.DataFrame(dict)
tmp_df = tmp_df.set_index('TIME')
tmp_df.index = pd.to_datetime(tmp_df.index)
Praveen
  • 2,137
  • 1
  • 18
  • 21
  • [How to provide a reproducible example **in R**](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) ... not python. – M-- Jan 19 '20 at 17:20

2 Answers2

2

We can use dplyr package:

library(dplyr)

dat %>% 
  group_by(TIME = format(dat$TIME,format='%d/%m/%Y %H:00:00'), level1, level2) %>% 
  count(name = "Count")

#> # A tibble: 9 x 4
#> # Groups:   TIME, level1, level2 [9]
#>   TIME                level1 level2 Count
#>   <chr>               <chr>  <chr>  <int>
#> 1 29/11/2019 00:00:00 A      a          1
#> 2 29/11/2019 00:00:00 B      a          1
#> 3 29/11/2019 00:00:00 B      b          2
#> 4 29/11/2019 00:00:00 B      c          1
#> 5 29/11/2019 01:00:00 A      a          2
#> 6 29/11/2019 02:00:00 A      a          1
#> 7 29/11/2019 02:00:00 B      a          1
#> 8 29/11/2019 02:00:00 B      b          2
#> 9 29/11/2019 02:00:00 B      c          1

Data: This is the data I used. Please provide your data by using dput(dat) instead of copy/pasting.

structure(list(TIME = structure(c(1574985900, 1574985900, 1574986020, 
1574986800, 1574985900, 1574990400, 1574990700, 1574992800, 1574992800, 
1574992800, 1574994900, 1574995740), class = c("POSIXct", "POSIXt"
), tzone = "UTC"), level1 = c("A", "B", "B", "B", "B", "A", "A", 
"A", "B", "B", "B", "B"), level2 = c("a", "a", "b", "b", "c", 
"a", "a", "a", "a", "b", "b", "c"), Occurrence = c(1L, 1L, 1L, 
1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L)), class = c("spec_tbl_df", 
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -12L), spec = structure(list(
    cols = list(TIME = structure(list(format = "%d/%m/%Y %H:%M"), class = c("collector_datetime", 
    "collector")), level1 = structure(list(), class = c("collector_character", 
    "collector")), level2 = structure(list(), class = c("collector_character", 
    "collector")), Occurrence = structure(list(), class = c("collector_integer", 
    "collector"))), default = structure(list(), class = c("collector_guess", 
    "collector")), skip = 1), class = "col_spec"))
M--
  • 25,431
  • 8
  • 61
  • 93
  • Thanks M--, with the command you provided the aggregation of the time is not happening. I am checking how to attached the sample dataset – Praveen Jan 19 '20 at 08:23
  • @Praveen that is because your `Time` column is not class of `POSIXct`. Convert it with `as.POSIXct` or using `lubridate::dmy_hm` and this works. – M-- Jan 19 '20 at 17:05
1

Using lubridate and dplyr, you can do

library(dplyr)
library(lubridate)
df %>%
  mutate(TIME = floor_date(dmy_hm(TIME), "hour")) %>%
  count(TIME, level1, level2)

# A tibble: 9 x 4
#  TIME                level1 level2     n
#  <dttm>              <fct>  <fct>  <int>
#1 2019-11-29 00:00:00 A      a          1
#2 2019-11-29 00:00:00 B      a          1
#3 2019-11-29 00:00:00 B      b          2
#4 2019-11-29 00:00:00 B      c          1
#5 2019-11-29 01:00:00 A      a          2
#6 2019-11-29 02:00:00 A      a          1
#7 2019-11-29 02:00:00 B      a          1
#8 2019-11-29 02:00:00 B      b          2
#9 2019-11-29 02:00:00 B      c          1
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • Thanks Ronak. With the command you provided, I get the warning message : All formats failed to parse. No formats found. And the Time is being evaluated as 'NA' I guess the date parsing is going wrong. Can you suggest what should be the time format ? – Praveen Jan 19 '20 at 08:27
  • @Praveen In the data you shared earlier you had hours and minutes but in the update you have hours minutes and seconds. So use `dmy_hms` instead. Try `df %>% mutate(TIME = floor_date(dmy_hms(TIME), "hour")) %>% count(TIME, level1, level2)` – Ronak Shah Jan 19 '20 at 08:29
  • @Rohan, my mistake. Sorry about that. The new command is working and producing the output that i expect. Thanks – Praveen Jan 19 '20 at 08:32