Apologies in advance for my English, it is not my first language.
I have a dataset of all the delay counts and lengths of domestic US flights. My goal is to investigate whether certain airports cause more delays than others.
I will specifically be looking at the top 20 airports by passenger count and comparing this to the whole average. I started off with:
> data <- read.csv("air.csv", header=T)
> str(data)
'data.frame': 263214 obs. of 22 variables:
$ year : int 2004 2004 2004 2004 2004 2004 2004 2004 2004 2004 ...
$ X.month : int 1 1 1 1 1 1 1 1 1 1 ...
$ carrier : chr "DL" "DL" "DL" "DL" ...
$ airport : chr "PBI" "PDX" "PHL" "PHX" ...
$ arr_flights : num 650 314 513 334 217 181 10 31 216 122 ...
$ arr_del15 : num 126 61 97 78 47 42 3 2 42 21 ...
$ carrier_ct : num 21.06 14.09 27.6 20.14 8.08 ...
$ X.weather_ct : num 6.44 2.61 0.42 2.02 0.44 1.06 1 0 0.43 0 ...
$ nas_ct : num 51.6 34.2 51.9 39.4 21.9 ...
$ security_ct : num 1 0 0 0 0 0 0 0 0 0 ...
$ late_aircraft_ct : num 45.9 10.1 17.1 16.4 16.6 ...
$ arr_cancelled : num 4 30 15 3 4 2 1 0 3 2 ...
$ arr_diverted : num 0 3 0 1 1 0 0 0 0 0 ...
$ X.arr_delay : num 5425 2801 4261 3400 1737 ...
$ X.carrier_delay : num 881 478 1150 1159 350 ...
$ weather_delay : num 397 239 16 166 28 195 189 0 12 0 ...
$ nas_delay : num 2016 1365 2286 1295 522 ...
$ security_delay : num 15 0 0 0 0 0 0 0 0 0 ...
$ late_aircraft_delay: num 2116 719 809 780 837 ...
> nrow(data)
[1] 263214
After which, I did the following:
> airport <- data[which(data$airport==
+ c("ATL","BOS","CLT","DEN","DFW",
+ "DTW","EWR","FLL","IAH","JFK",
+ "LAS","LAX","MCO","MIA","MSP",
+ "ORD","PHL","PHX","SEA","SFO")),]
Warning message:
In data$airport == c("ATL", "BOS", "CLT", "DEN", "DFW", "DTW", "EWR", :
longer object length is not a multiple of shorter object length
> nrow(airport)
[1] 2406
I played around with the data beforehand in Excel and had 46521 data-points so I am not quite sure why it has only returned 2406. Could someone please provide me with some clarity :)
Thanks in advance !!