0

I am currently working on a project and I need some help. I want to predict the length of flight delays, using a statistical model. The data set does not contain the length of flight delays, but it can be calculated from the actual and scheduled departure times, I know that actual departure times - scheduled departure time will give me the flight delay which is the dependent variable. I am struggling to get the explanatory (independent) variables in a useful form to do regression analysis - the main problem is the time format of the first two columns when you read in the table from the csv file. I have attach the data file to the question because I wasn't too sure how to attach my file, I'm new to this coding thing hehe. Any help will be appreciated. xx

https://drive.google.com/file/d/11BXmJCB5UGEIRmVkM-yxPb_dHeD2CgXa/view?usp=sharing

EDIT:

Firstly Thank you for all the help

Okay I'm going to try and ask more precise questions on this topic:

So after importing the file using:

1)

    Delays <- read.table("FlightDelaysSM.csv",header =T,sep=",") 

2)The main issue I am having is getting the columns schedule time and deptime into a format where I can do arithmetic calculation

3)I tried the below

    Delays[,1] - Delays[,2] 

where the obvious issue arises for example 800 (8am) - 756 (7.56am) = 44 not 4 minutes

4)Using the help from @kerry Jackson (thank you, you're amazing x) I tried

    DepartureTime <- strptime(formatC(Delays$deptime, width = 4, format = "d", flag = "0", %H%M)

    ScheduleTime <- strptime(formatC(Delays$schedtime, width = 4, format = "d", flag = "0", %H%M)

    DelayTime = DepartureTime - ScheduleTime

The values are also given are in seconds, I want the difference to be in minutes how would I go about doing this?

5) I then did the following:

    DelayData <- data.frame(ScheduleTime, DepartureTime, DelayTime, Delays[, 4:7])

What I attain after making the DelayData

as you can see by the image I have the seconds units in my column called DelayTime which I don't want as stated in 4), and the date is in the columns ScheduleTime and DepartureTime could I possibly get some suggestions on how to correct this?

  • 1
    So what exactly did you try? Where are you getting stuck? Stack Overflow is for specific programming questions. It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input (in a reproducible format, not on an external site) and desired output that can be used to test and verify possible solutions. – MrFlick Mar 26 '19 at 17:46
  • If you are trying to calculate the flight delay, perhaps you want to do something like `strptime(formatC(df$deptime, width = 4, format = "d", flag = "0"), "%H%M") - strptime(formatC(df$schedtime, width = 4, format = "d", flag = "0"), "%H%M")`. – Kerry Jackson Mar 26 '19 at 18:20
  • Thank you guys so much I have edited the question so it is hopefully more clear on what i'm stuck on. – CuriousKathy Mar 26 '19 at 20:10

1 Answers1

-1

Create a new column called flight_delay:

install.packages('tidyverse')
library(tidyverse)

your_data <- your_data %>%
  mutate(flight_delay=deptime-schedtime)

Now, create a linear regression model predicting flight_delay by every other variable:

mod <- lm(flight_delay ~ ., data=your_data)

To optimize your model, use the step function:

mod <- step(mod)

Analyze results:

summary(mod)
jfeuerman
  • 177
  • 9