0

I have an airline dataset from stat computing which I am trying to analyse.

There are variables DepTime and ArrDelay (Departure Time and Arrival Delay). I am trying to analyse how Arrival Delay is varying with certain chunks of departure time. My objective is to find which time chunks should a person avoid while booking their tickets to avoid arrival delay

My understanding-If a one tailed t test between arrival delays for dep time >1800 and arrival delays for dep time >1900 show a high significance, it means that one should avoid flights between 1800 and 1900. ( Please correct me if I am wrong). I want to run such tests for all departure hours.

**Totally new to programming and Data Science. Any help would be much appreciated.

Data looks like this. The highlighted columns are the ones I am analysing

enter image description here

iehrlich
  • 3,572
  • 4
  • 34
  • 43
Anu
  • 9
  • 4
  • So do you want to test all departure hours against each other? It may be better to test each hour vs. all hours that way you know which times are better/worse than "an average day." Why don't you post some data and what you want the output to look like so we can better help you. – emilliman5 Nov 30 '16 at 14:05
  • 1
    See this [SO Post](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) on how to make a R reproducible example – emilliman5 Nov 30 '16 at 14:30
  • Sorry for the previous comment. So considering just the two columns DepTime and ArrDelay data looks like this [1829(time): 23(delay in minutes)], [1700:10], [1000: 5],[1750:137]. Your idea sounds fine too. I basically want to see which hours in a day are not so favorable to travel w.r.t delays. – Anu Nov 30 '16 at 14:32
  • Please put all code and data necessary to reproduce this in the question itself – Hack-R Nov 30 '16 at 14:45
  • Added a snapshot of the dataset to the question. – Anu Nov 30 '16 at 18:19

1 Answers1

0

Sharing an image of the data is not the same as providing the data for us to work with...

That said I went and grabbed one year of data and worked this up.

flights <- read.csv("~/Downloads/1995.csv", header=T)

flights <- flights[, c("DepTime", "ArrDelay")]
flights$Dep <- round(flights$DepTime-30, digits = -2)
head(flights, n=25)

# This tests each hour of departures against the entire day. 
# Alternative is set to "less" because we want to know if a given hour
# has less delay than the day as a whole.

pVsDay <- tapply(flights$ArrDelay, flights$Dep, 
                 function(x) t.test(x, flights$ArrDelay, alternative = "less"))

# This tests each hour of departures against every other hour of the day. 
# Alternative is set to "less" because we want to know if a given hour
# has less delay than the other hours.
pAllvsAll <- tapply(flights$ArrDelay, flights$Dep, 
                           function(x) tapply(flights$ArrDelay, flights$Dep, function (z) 
                             t.test(x, z, alternative = "less")))

I'll let you figure out multiple hypothesis testing and the like.

enter image description here

All vs All

enter image description here

Community
  • 1
  • 1
emilliman5
  • 5,816
  • 3
  • 27
  • 37
  • Thanks a lot! I am new to stackoverflow. Apologies for not posting the dataset! I understood your code.. however when I run this I am getting the following output..Am I missing something? Length Class Mode 0 9 htest list 100 9 htest list 200 9 htest list 500 9 htest list 600 9 htest list 700 9 htest list 800 9 htest list 900 9 htest list 1000 9 htest list 1100 9 htest list 1200 9 htest list 1300 9 htest list 1400 9 htest list 1500 9 htest list – Anu Nov 30 '16 at 22:37
  • to access the comparison of hour 900 to the entire day use `pVsDay[[10]]`, to access the comparision between 2200 and 1300 use `pAllvsAll[[23]][[14]]` – emilliman5 Nov 30 '16 at 22:51
  • Thanks a lott! Owe you one big time. With advice like this one doesn't get intimidated by programming. – Anu Nov 30 '16 at 23:22
  • so I should be able to access just the pvalues by pVsDay[[10]]$p.value right? Last question..I am struggling to plot the graph. How did you plot it? can a function be used inside qplot or ggplot? – Anu Nov 30 '16 at 23:27
  • Correct, ...$p.value will return just the pvalue. It would probably be easiest to extract the pvalues to a new object and then plot – emilliman5 Nov 30 '16 at 23:29
  • For storing the pvalues into an object, I am writing this. However, I am getting an error "Error in pVsDay[[i]] : attempt to select less than one element in get1index " Can you help me understand where I am going wrong? dayplist <- NULL for (i in seq(0,24,1)) { dayplist <- c(dayplist,pVsDay[[i]]$p.value) } – Anu Dec 01 '16 at 00:11
  • Indexing starts at 1 not 0. `seq(1, length(pVsDay),1)` – emilliman5 Dec 01 '16 at 00:33
  • Can you help me learn how you plotted the above graphs? (Sorry if I sound too ignorant) for (i in seq(1,length(pAllvsAll),1)) { allplist <- c(allplist,pAllvsAll[[i]]$p.value) } str(allplist) This works well with pVsDay. This doesnt work for pAllvsAll. It gives me NULL. :( – Anu Dec 01 '16 at 01:05
  • pAllvsAll is a list of lists, you need to iterate through the second list to get the p.value. Try `str(pAllvsAll)` to see what I mean – emilliman5 Dec 01 '16 at 13:58