0

I'm trying to group/aggregate my taxi trip data by time (at the granularity of an hour), month, day, year and pick up location ID.

So that my output data should have a row like 2014 04 01 1 123 375 ; this is representing 375 taxi trips happened on the 1st April 2014 at 1am at the pick up location 123.

My input dataframe are:

PULocationID day month year hour
    153       1    1   2014    1  
    122       3    12  2012   13
    153       1    1   2014    1
    122       3    12  2012   13

I would like these to then be grouped and look like the below with a new taxi_trips column:

PULocationID day month year hour Taxi_Trips
    153       1    1   2014    1   2
    122       3    12  2012   13   2
smci
  • 32,567
  • 20
  • 113
  • 146
KarC
  • 101
  • 5
  • Welcome to SO. In general you're expected to attempt with your own code and then post it here when you get stuck; please see any good tutorial/reference on `groupby` or `dplyr` e.g. https://rdrr.io/cran/dplyr/man/group_by.html – smci Jul 11 '20 at 23:46

2 Answers2

2

We can use count from dplyr

library(dplyr)
df1 %>% 
     count(PULocationID, day, month, year, hour)
#  PULocationID day month year hour n
#1          122   3    12 2012   13 2
#2          153   1     1 2014    1 2

Or to make this compact specify the column names, convert to symbol and evaluate (!!!)

df1 %>%
     count(!!! rlang::syms(names(.)))

Or using aggregate from base R

aggregate(Taxi_Trips ~ ., transform(df1, Taxi_Trips = 1), FUN = sum)
#    PULocationID day month year hour Taxi_Trips
#1          153   1     1 2014    1          2
#2          122   3    12 2012   13          2

data

df1 <- structure(list(PULocationID = c(153L, 122L, 153L, 122L), day = c(1L, 
3L, 1L, 3L), month = c(1L, 12L, 1L, 12L), year = c(2014L, 2012L, 
2014L, 2012L), hour = c(1L, 13L, 1L, 13L)), class = "data.frame", 
row.names = c(NA, 
-4L))
akrun
  • 874,273
  • 37
  • 540
  • 662
0

You can use an aggregate function.

    data = data.frame(PULocationID = c(153, 122, 153, 122), 
                        day = c(1, 3, 1, 3), 
                        month = c(1, 12, 1, 12), 
                        year = c(2014, 2012, 2014, 2012), 
                        hour = c(1, 13, 1, 13))

data$Taxi_trips = 1
aggregate(cbind(Taxi_trips = Taxi_trips) ~ PULocationID + day + month + year + hour,
                            data = data, function(x){NROW(x)})
data

  PULocationID day month year hour Taxi_trips
1          153   1     1 2014    1          1
2          122   3    12 2012   13          1
3          153   1     1 2014    1          1
4          122   3    12 2012   13          1

N in the new dataframe is the number of trips.

Haci Duru
  • 456
  • 3
  • 9