How to identify unique data between two dataframes

Question

I am sorry if the title isn't the best. I am not sure how to put this in the correct terms.

I am doing some filtering using dpylr. So to give a little background, df1 is a list of all of the human genes. df2 has a list of gene involved in some pathway. The software that gives me the list for df2 doesn't always use the correct gene name that is in df1 so they get skipped when I use this filter

filtered <- df1 %>%
    filter(gene.name %in% df2$V1)

So I am missing some of the data that I am interested in. I was wondering if there was a way to compare the new df called filtered to df2 with some code that marks unique difference? The majority of the filtered data frame will be the same as df2 but df2 will just have the gene names that were incorrect. The reason I want to do this is because I want to go back and correct the gene names. I df1 and df2 are much larger than the examples so it isn't easy to catch.

Here is an example of what I am saying so maybe it will make more sense

df1

gene.name ADCY1 ADCY2 ADCY3 ADCY4

df2

gene.name AC1 ADCY2 AC3 ADCY4

filtered

gene.name ADCY2 ADCY4

I'm still not sure what you're trying to get. Are you trying to find genes in `df2` that are not in `df1`? — divibisan, Mar 27 '19 at 17:26
Possible duplicate of [How to join (merge) data frames (inner, outer, left, right)?](https://stackoverflow.com/questions/1299871/how-to-join-merge-data-frames-inner-outer-left-right) — NelsonGon, Mar 27 '19 at 17:29
The `df1`, `df2`, `filtered` examples at the bottom are a good start. Please add in that same format your desired result. — Gregor Thomas, Mar 27 '19 at 17:35
@divibisan I am filtering `df1` by `df2` giving `filtered`. I want to know what is different between `filtered` and `df` so for example `df2` is four genes and `filtered` has two genes. I want to know what two genes `df2` has that `filtered` doesn't so I can go back to df2 and find the correct gene name for AC1 and AC3 — neuron, Mar 27 '19 at 17:36
Seems like creating `filtered` isn't really necessary. Add a column to `df1` instead: `df1 %>% mutate(in_df2 = gene.name %in% df2$V1)`. — Gregor Thomas, Mar 27 '19 at 17:37
If you want genes in `df2` that are not in `df1`, is this what you are looking for: `df2$gene.name[!df2$gene.name %in% df1$gene.name]`? — Andrew, Mar 27 '19 at 17:40
Okay, voting to close as typo. You basically wanted to create a column and used a command to subset instead. (Do please be more specific about your desired output in the future. Showing an example *and* describing in words is best.) — Gregor Thomas, Mar 27 '19 at 17:47
@Gregor That is fine with me. I am sorry about that. I search for what I was trying to describe but I couldn't seem to find what I was looking for. I will do that in the future. Thank you so much for the help — neuron, Mar 27 '19 at 17:50

score 1 · Accepted Answer · answered Apr 08 '19 at 01:17

1

Another fun way to get the not-in functionality is to write little function like

'%notin%' <- function(x,y)!('%in%'(x,y))

Then you can get the answer in a nice tidy way:

unique_differences<-df2 
    %>% filter(gene.name %notin% unlist(df1))

answered Apr 08 '19 at 01:17

Joe Erinjeri

1,200
1
7
15

That works very nice as well. I changed the name of df2 to `filter` and that gave me some issues because the software thought I was listing `filter` twice. So if anyone is reading this, don't do that. Thanks for the answer!! I appreciate you taking the time to write that out. – neuron Apr 09 '19 at 15:10

How to identify unique data between two dataframes

1 Answers1