0

I have 2 dataframes

x <- data.frame(ID=c("A","B","C","D"),y=c(1,2,3,4),animal=c("squid","shark", NA, NA))

which looks like

  ID y animal
1  A 1  squid
2  B 2  shark
3  C 3   <NA>
4  D 4   <NA>

and

y <- data.frame(ID=c("A","B","C","D"),y=c(1,2,3,4),animal=c(NA,NA, "dog", "cat"))

which looks like

  ID y animal
1  A 1   <NA>
2  B 2   <NA>
3  C 3    dog
4  D 4    cat

I want to merge these datasets by their ID so that the animal columns complete each other. However, when I try to do this using merge or the joins offered by dplyr I get these results

#example code 
merge(x, y, by = "ID")

#output 
  ID y.x animal.x y.y animal.y
1  A   1    squid   1     <NA>
2  B   2    shark   2     <NA>
3  C   3     <NA>   3      dog
4  D   4     <NA>   4      cat

What I'd like is for the output to just be

  ID y animal
1  A 1  squid
2  B 2  shark
3  C 3   dog
4  D 4   cat

There are ways to go about combining that with something like a loop that creates a final column where the incomplete data is tossed and only complete data is kept, but I'm curious if there are handy functions out there that can fix this with a one-liner

It seems like people have asked similar questions to this, but most of the posts I found were more interested in just renaming the shared column names between the 2 datasets to avoid the .x and .y results merge produces

Jin
  • 527
  • 6
  • 21
  • 1
    Does this answer your question? [Can I replace NAs when joining two data frames with dplyr?](https://stackoverflow.com/questions/39110179/can-i-replace-nas-when-joining-two-data-frames-with-dplyr) – caldwellst Feb 20 '20 at 05:02
  • This seems pretty similar, but it seems the first dplyr answer doesn't work when I run it and returns the error message `Error in select(., -var2.x, -var2.y) : unused arguments (-var2.x, -var2.y)` – Jin Feb 20 '20 at 05:11
  • As a note, if you want to use `merge()`, you can do something like this. `foo <- merge(x, y, by = intersect(names(x), names(y)), all = TRUE); subset(foo, complete.cases(animal))` – jazzurro Feb 20 '20 at 05:16
  • @762 The first `dplyr` answer works if you use column names which is present in your data i.e change from `var2.x` and `var2.y` to `animal.x` and `animal.y`. – Ronak Shah Feb 20 '20 at 05:22
  • @jazzurro I think the semicolon doesn't work there because I still get the NA rows in animals (which is odd), but when it's on a separate line it somehow works. Do you notice the same issue? – Jin Feb 20 '20 at 05:24
  • @RonakShah I actually ran the exact same code on the same dataframe objects in those answers, but it returned that error message I mentioned above. I just copy pasted everything. The datatable example worked, but the dplyr one did not – Jin Feb 20 '20 at 05:24
  • If I copy-paste it as it is, it works fine for me without any error. You have package `MASS` loaded in your environment probably. Try with `dplyr::select` instead. – Ronak Shah Feb 20 '20 at 05:28
  • @762 Then, you can create two lines in your script. – jazzurro Feb 20 '20 at 05:40
  • @jazzurro yes, that what I did, but I was confused why the semicolon doesnt work like normally. Was curious to see if there's something wrong in my environment – Jin Feb 20 '20 at 05:43

1 Answers1

0
library(dplyr)
x %>%
   left_join(y,by=c("ID","y")) %>%
   mutate(animal = coalesce(animal.x,animal.y))
  ID y animal.x animal.y animal
1  A 1    squid     <NA>  squid
2  B 2    shark     <NA>  shark
3  C 3     <NA>      dog    dog
4  D 4     <NA>      cat    cat


y <- data.frame(ID=c("A","B","C","D"),y=c(1,2,3,4),animal=c(NA,NA, "dog", "cat"),stringsAsFactors = F)

y <- data.frame(ID=c("A","B","C","D"),y=c(1,2,3,4),animal=c(NA,NA, "dog", "cat"),stringsAsFactors = F)
Onyambu
  • 67,392
  • 3
  • 24
  • 53