0

I have a DF with 4 columns. In the first column are stations and in the other 3 columns are time, weekday and number of people. My goal is to make a regression(glm) for every single station. I think with a list it would be easier or? My question is, how do I make a list and how do I do the regression(glm) for each station using the list?

my Df looks like this:

here is a picture of my DF enter image description here

Example code:

TrainStation is chr, Weekday and timeOfday are factors and NumberOfPassenger is num.

    TrainStation<-c("East","North","East","North","North","Central","North","Central","East","North","East","North","Central","North","Central","North","Central","North","Central","North","Central","North","Central","East","North","East","North","Central","North","Central","East","North","East","North","Central","East")
TimeOfday<-c(12,12,8,16,10,6,0,7,1,3,23,15,12,8,16,10,1,3,5,7,9,10,12,11,17,2,4,5,13,14,18,19,20,21,22,23)
Date<-sample(seq(as.Date('2019/01/01'), as.Date('2019/02/28'), by="day"), 36)
Date<-as.character(Date)
DF<-cbind(TrainStation,TimeOfday,Date)
DF<-as.data.frame(DF)

#Weekdays
DF$Date<-as.Date(DF$Date)
DF$Date<-weekdays(DF$Date)
#TimeOfday
DF$TimeOfday<-strptime(DF$TimeOfday,format = "%H")
DF$TimeOfday<-hour(DF$TimeOfday)

DF$TrainStation<-as.character(DF$TrainStation)
DF$TimeOfday<-as.factor(DF$TimeOfday)
DF$Date<-as.factor(DF$Date)

#Data for regression
library(tidyverse)
DF2<-DF%>%
  group_by(TrainStation,Date,TimeOfday)%>%
  summarize(NumberOfPassenger = n_distinct(TrainStation))

Thank you very much for your help!

Community
  • 1
  • 1
Edin Mar
  • 101
  • 6

1 Answers1

1

Using your data this is what you could do:

You data

TrainStation<-c("East","North","East","North","North","Central","North","Central","East","North","East","North","Central","North","Central","North","Central","North","Central","North","Central","North","Central","East","North","East","North","Central","North","Central","East","North","East","North","Central","East")
TimeOfday<-c(12,12,8,16,10,6,0,7,1,3,23,15,12,8,16,10,1,3,5,7,9,10,12,11,17,2,4,5,13,14,18,19,20,21,22,23)
Date<-sample(seq(as.Date('2019/01/01'), as.Date('2019/02/28'), by="day"), 36)
Date<-as.character(Date)
DF<-cbind(TrainStation,TimeOfday,Date)
DF<-as.data.frame(DF)

#Weekdays
DF$Date<-as.Date(DF$Date)
DF$Date<-weekdays(DF$Date)
#TimeOfday
DF$TimeOfday<-strptime(DF$TimeOfday,format = "%H")
DF$TimeOfday<-hour(DF$TimeOfday)

DF$TrainStation<-as.character(DF$TrainStation)
DF$TimeOfday<-as.factor(DF$TimeOfday)
DF$Date<-as.factor(DF$Date)

#Data for regression
library(tidyverse)
DF2<-DF%>%
  group_by(TrainStation,Date,TimeOfday)%>%
  summarize(NumberOfPassenger = n_distinct(TrainStation))

Now moving into the modeling section you can use nested column and then apply your model

DF2 %>%
  ungroup() %>% 
  group_by(TrainStation) %>% 
  nest() %>% 
  mutate(model = map(data, ~glm(NumberOfPassenger~TimeOfday+Date, family = poisson(), data = .)))

That will give you something that looks like:

# A tibble: 3 x 3
  TrainStation data              model    
  <chr>        <list>            <list>   
1 Central      <tibble [11 x 3]> <S3: glm>
2 East         <tibble [9 x 3]>  <S3: glm>
3 North        <tibble [16 x 3]> <S3: glm>

Which has all of the nested features. If you want to extract the model parameters for each station you could do something like:

TrainStation<-c("East","North","East","North","North","Central","North","Central","East","North","East","North","Central","North","Central","North","Central","North","Central","North","Central","North","Central","East","North","East","North","Central","North","Central","East","North","East","North","Central","East")
TimeOfday<-c(12,12,8,16,10,6,0,7,1,3,23,15,12,8,16,10,1,3,5,7,9,10,12,11,17,2,4,5,13,14,18,19,20,21,22,23)
Date<-sample(seq(as.Date('2019/01/01'), as.Date('2019/02/28'), by="day"), 36)
Date<-as.character(Date)
DF<-cbind(TrainStation,TimeOfday,Date)
DF<-as.data.frame(DF)

#Weekdays
DF$Date<-as.Date(DF$Date)
DF$Date<-weekdays(DF$Date)
#TimeOfday
DF$TrainStation<-as.character(DF$TrainStation)

DF$TimeOfday<-as.factor(DF$TimeOfday)
DF$Date<-as.factor(DF$Date)

#Data for regression
library(tidyverse)
DF2<-DF%>%
  group_by(TrainStation,Date,TimeOfday)%>%
  summarize(NumberOfPassenger = n_distinct(TrainStation))

DF2 %>%
  ungroup() %>% 
  group_by(TrainStation) %>% 
  nest() %>% 
  mutate(model = map(data, ~glm(NumberOfPassenger~TimeOfday+Date, family = poisson(), data = .))) %>% 
  mutate(tidy_model = map(model, broom::tidy)) %>% 
  select(TrainStation, tidy_model) %>% 
  unnest(tidy_model)

In order to give you all of the parameters from the model for each stations

# A tibble: 35 x 6
   TrainStation term           estimate std.error statistic p.value
   <chr>        <chr>             <dbl>     <dbl>     <dbl>   <dbl>
 1 Central      (Intercept)    4.68e-11     1.000  4.68e-11   1.000
 2 Central      TimeOfday12   -3.19e-35     1.41  -2.26e-35   1    
 3 Central      TimeOfday14    5.24e-34     1.41   3.70e-34   1    
 4 Central      TimeOfday16    1.03e-34     1.41   7.28e-35   1    
 5 Central      TimeOfday22   -5.21e-18     2.00  -2.61e-18   1    
 6 Central      TimeOfday5    -5.21e-18     1.41  -3.68e-18   1    
 7 Central      TimeOfday6     2.17e-34     1.41   1.53e-34   1  
MDEWITT
  • 2,338
  • 2
  • 12
  • 23
  • thanks for the detailed answer. It works in this example but in the original data I get: Error: evaluation error contrasts can be applied only to factors with 2 or more levels. I have only two factors(Timeofday and weekday) and they have 24 and 7 levels – Edin Mar May 15 '19 at 14:57
  • This could occur if you have missing data, unused levels, see and – MDEWITT May 15 '19 at 15:52