I have a dataset with route_ID
and 100 other variables and I need to add the flight length to each observation. So I have created another dataset with two variables, the route_ID
, and the flight distance. The original dataset has 125k observations, whereas the distances one 13k as I have eliminated duplicate values. Moreover, the original dataset has 4k different routes, whereas the distance one has 11k different routes. This is to be sure that out of those 11k, there will be all the 4k ones that need to be matched.
This is what dput(head(airplanes))
produces
structure(list(ap_id = c("15304 12478", "12478 15304", "15304 12953",
"13303 12478", "13303 12953", "14986 12478"), ORIGIN_AIRPORT_ID = c(15304L,
12478L, 15304L, 13303L, 13303L, 14986L), DEST_AIRPORT_ID = c(12478L,
15304L, 12953L, 12478L, 12953L, 12478L), distance = c(1005L,
1005L, 1010L, 1089L, 1096L, 1041L)), row.names = c(NA, 6L), class = "data.frame")
This is what dput(head(distance))
produces
structure(list(apc_id = structure(c("10135 10397 DL", "10135 10397 DL",
"10135 10397 FL", "10135 10397 FL", "10135 11057 US", "10135 11057 US"
), label = "Route-carrier unique identifier", format.stata = "%16s"),
rcid = structure(c(1, 1, 2, 2, 6, 6), label = "Route-carrier unique identifier", format.stata = "%14.0g", labels = c(`10135 10397 DL` = 1,
`10135 10397 FL` = 2, `10135 10721 FL` = 3, `10135 10821 TW` = 4,
`10135 11042 RU` = 5, `10135 11057 US` = 6, `10135 11193 DL` = 7,
This continues for A LOT of rows, giving each route_code and the number of times it appears. Not sure why the two codes are giving such a different output
I was sure that left_join
would have been the way to go, as it is supposed to keep the rows of the OG one and add the values from the second one only for the ones that match. This was my code.
left_join(airplanes, distance, by = "route_ID")
However, this generates a new dataset with more than 600k observations. The weird thing also is that many variables, such as ticket_price
, have zero NAs in the new dataset and the percentile is almost exactly the same as the one in the OG dataset.
If helpful, this is the dropbox link to the OG dataset https://www.dropbox.com/s/t9ptw6a9tuuh4tg/DB1B-T100-MSA-F41%20Final%20V2.dta?dl=0
And this is the one to the distances dataset https://www.dropbox.com/s/m0k2i1vfqfnb7o7/distance_2.csv?dl=0
Anyone knows what is happening here and how could I match them as I wish?