-1

I have a large CSV file which I decided to import into R and use for some data analysis. Bascially it is file with flight delays for few years and trying to create a graph to see the average delay per day of the week. I thought of the histogram but it plots graph which is not usable? Any idea please let me know. Would other graph work better? Also is there any easy way to compare on time flights to delayed flights per day of the week?

file name - airline

str(airline)

'data.frame':   7009728 obs. of  29 variables:
 $ Year             : int  2008 2008 2008 2008 2008 2008 2008 2008 2008 2008 ...
 $ Month            : int  1 1 1 1 1 1 1 1 1 1 ...
 $ DayofMonth       : int  3 3 3 3 3 3 3 3 3 3 ...
 $ DayOfWeek        : int  4 4 4 4 4 4 4 4 4 4 ...
 $ DepTime          : int  2003 754 628 926 1829 1940 1937 1039 617 1620 ...
 $ CRSDepTime       : int  1955 735 620 930 1755 1915 1830 1040 615 1620 ...
 $ ArrTime          : int  2211 1002 804 1054 1959 2121 2037 1132 652 1639 ...
 $ CRSArrTime       : int  2225 1000 750 1100 1925 2110 1940 1150 650 1655 ...
 $ UniqueCarrier    : Factor w/ 20 levels "9E","AA","AQ",..: 18 18 18 18 18 18 18 18 18 18 ...
 $ FlightNum        : int  335 3231 448 1746 3920 378 509 535 11 810 ...
 $ TailNum          : Factor w/ 5374 levels "","80009E","80019E",..: 3769 4129 1961 3059 2142 3852 4062 1961 3616 3324 ...
 $ ActualElapsedTime: int  128 128 96 88 90 101 240 233 95 79 ...
 $ CRSElapsedTime   : int  150 145 90 90 90 115 250 250 95 95 ...
 $ AirTime          : int  116 113 76 78 77 87 230 219 70 70 ...
 $ ArrDelay         : int  -14 2 14 -6 34 11 57 -18 2 -16 ...
 $ DepDelay         : int  8 19 8 -4 34 25 67 -1 2 0 ...
 $ Origin           : Factor w/ 303 levels "ABE","ABI","ABQ",..: 136 136 141 141 141 141 141 141 141 141 ...
 $ Dest             : Factor w/ 304 levels "ABE","ABI","ABQ",..: 287 287 49 49 49 151 157 157 177 177 ...
 $ Distance         : int  810 810 515 515 515 688 1591 1591 451 451 ...
 $ TaxiIn           : int  4 5 3 3 3 4 3 7 6 3 ...
 $ TaxiOut          : int  8 10 17 7 10 10 7 7 19 6 ...
 $ Cancelled        : int  0 0 0 0 0 0 0 0 0 0 ...
 $ CancellationCode : Factor w/ 5 levels "","A","B","C",..: 1 1 1 1 1 1 1 1 1 1
 $ Diverted         : int  0 0 0 0 0 0 0 0 0 0 ...
 $ CarrierDelay     : int  NA NA NA NA 2 NA 10 NA NA NA ...
 $ WeatherDelay     : int  NA NA NA NA 0 NA 0 NA NA NA ...
 $ NASDelay         : int  NA NA NA NA 0 NA 0 NA NA NA ...
 $ SecurityDelay    : int  NA NA NA NA 0 NA 0 NA NA NA ...
 $ LateAircraftDelay: int  NA NA NA NA 32 NA 47 NA NA NA ...

my graph:

library(ggplot2)
ggplot(airline,aes(x = DayOfWeek, fill = factor(DepDelay))) +
  geom_histogram(binwidth = 1) +
  xlab ("Day of week") +
  ylab ("Dep Delay") +
  labs (fill = "Airline")
Kalenji
  • 401
  • 2
  • 19
  • 42

1 Answers1

1

To a great extent it would depend on what do you want to show. I made a small example using the flights data available in the nycflights13 package. Using the code below you could experiment with charts that would meet your analytical requirements.

Code

# Libs and data -----------------------------------------------------------

Vectorize(require)(package = c("nycflights13", "ggplot2", "ggthemes",
                               "dplyr"),
                   character.only = TRUE)

# Work -------------------------------------------------------------------

flights %>%
    # Create week day summary
    mutate_each(funs(as.character), 1:3) %>% 
    mutate(date = as.Date(paste(year, month, day, sep = "-"))) %>% 
    mutate(weekday = weekdays(date, abbreviate = FALSE)) %>% 
    group_by(weekday, carrier) %>% 
    na.omit() %>% 
    summarise(mean_dl = round(mean(dep_delay),2)) %>% 
    ggplot(aes(x = as.factor(weekday), y = mean_dl)) +
    geom_bar(stat = "identity") +
    facet_wrap(~carrier) +
    xlab("Day") +
    ylab("Mean Dep Delay") +
    theme_wsj() +
    theme(axis.text.x = element_text(angle = 90))

Results

For example, this could be a modest start:

Sample flights data


If you want to get a better answer, I would suggest that you have a look at this discussion on producing a good R example. I would further took the liberty of suggesting that you:

  • Post a neat data extract that would be easy for other colleagues to work with
  • Elaborate more on the problem you are facing with respect to the particular chart you want to develop.

Comparing flight delays

You can make the further use of the dplyr grammar to compare flights on time and delayed ones.

Code

For example you could use the code below to count flights that were on time and the delayed ones per each day:

flights %>%
    # Create week day summary
    mutate_each(funs(as.character), 1:3) %>% 
    mutate(date = as.Date(paste(year, month, day, sep = "-"))) %>% 
    mutate(weekday = weekdays(date, abbreviate = FALSE)) %>% 
    # Create flag for on time / dly
    mutate(ontime = ifelse(dep_delay == 0, "on-time", "delayed")) %>% 
    group_by(weekday, ontime) %>% 
    na.omit() %>% 
    summarise(count_flights = n()) 
Community
  • 1
  • 1
Konrad
  • 17,740
  • 16
  • 106
  • 167