0

I have a dataset with route_ID and 100 other variables and I need to add the flight length to each observation. So I have created another dataset with two variables, the route_ID, and the flight distance. The original dataset has 125k observations, whereas the distances one 13k as I have eliminated duplicate values. Moreover, the original dataset has 4k different routes, whereas the distance one has 11k different routes. This is to be sure that out of those 11k, there will be all the 4k ones that need to be matched.

This is what dput(head(airplanes)) produces

structure(list(ap_id = c("15304 12478", "12478 15304", "15304 12953", 
"13303 12478", "13303 12953", "14986 12478"), ORIGIN_AIRPORT_ID = c(15304L, 
12478L, 15304L, 13303L, 13303L, 14986L), DEST_AIRPORT_ID = c(12478L, 
15304L, 12953L, 12478L, 12953L, 12478L), distance = c(1005L, 
1005L, 1010L, 1089L, 1096L, 1041L)), row.names = c(NA, 6L), class = "data.frame")

This is what dput(head(distance)) produces

structure(list(apc_id = structure(c("10135 10397 DL", "10135 10397 DL", 
"10135 10397 FL", "10135 10397 FL", "10135 11057 US", "10135 11057 US"
), label = "Route-carrier unique identifier", format.stata = "%16s"), 
    rcid = structure(c(1, 1, 2, 2, 6, 6), label = "Route-carrier unique identifier", format.stata = "%14.0g", labels = c(`10135 10397 DL` = 1, 
    `10135 10397 FL` = 2, `10135 10721 FL` = 3, `10135 10821 TW` = 4, 
    `10135 11042 RU` = 5, `10135 11057 US` = 6, `10135 11193 DL` = 7,

This continues for A LOT of rows, giving each route_code and the number of times it appears. Not sure why the two codes are giving such a different output

I was sure that left_join would have been the way to go, as it is supposed to keep the rows of the OG one and add the values from the second one only for the ones that match. This was my code.

left_join(airplanes, distance, by = "route_ID")

However, this generates a new dataset with more than 600k observations. The weird thing also is that many variables, such as ticket_price, have zero NAs in the new dataset and the percentile is almost exactly the same as the one in the OG dataset.

If helpful, this is the dropbox link to the OG dataset https://www.dropbox.com/s/t9ptw6a9tuuh4tg/DB1B-T100-MSA-F41%20Final%20V2.dta?dl=0

And this is the one to the distances dataset https://www.dropbox.com/s/m0k2i1vfqfnb7o7/distance_2.csv?dl=0

Anyone knows what is happening here and how could I match them as I wish?

  • Does this answer your question? [How to join (merge) data frames (inner, outer, left, right)](https://stackoverflow.com/questions/1299871/how-to-join-merge-data-frames-inner-outer-left-right) – Pax May 27 '22 at 22:51
  • 1
    I assume this is in R? I added the R tag for language but please correct me if I am wrong – jpsmith May 27 '22 at 22:55
  • 1
    We don't have any idea what your data looks like. Can you please edit your question to include the output produced when you run `dput(head(airplanes))` and `dput(head(distance))`? – Jon Spring May 27 '22 at 23:23
  • 1
    You need to copy/paste the output of `dput()`, not add an image. It lets people copy the data into their `R` env – BHudson May 27 '22 at 23:41
  • In the meantime, I suggest you check [this question on merging out](https://stackoverflow.com/questions/24150765/why-does-merge-result-in-more-rows-than-original-data) that discusses a common reason duplicates occur when merging – BHudson May 27 '22 at 23:43
  • [Why should I not upload images of code/data/errors when asking a question?](https://meta.stackoverflow.com/q/285551/3404097) [Why are images of text, code and mathematical expressions discouraged?](https://meta.stackexchange.com/q/320052/266284). Please in code questions give a [mre]. [ask] [Help] Please before considering posting read the manual/reference & google any error message & many clear, concise & precise phrasings of your question/problem/goal, with & without your particular names/strings/numbers, 'site:stackoverflow.com' & tags; read many answers. – philipxy May 27 '22 at 23:58

0 Answers0