0

I am sorry if the title isn't the best. I am not sure how to put this in the correct terms.

I am doing some filtering using dpylr. So to give a little background, df1 is a list of all of the human genes. df2 has a list of gene involved in some pathway. The software that gives me the list for df2 doesn't always use the correct gene name that is in df1 so they get skipped when I use this filter

filtered <- df1 %>%
    filter(gene.name %in% df2$V1)

So I am missing some of the data that I am interested in. I was wondering if there was a way to compare the new df called filtered to df2 with some code that marks unique difference? The majority of the filtered data frame will be the same as df2 but df2 will just have the gene names that were incorrect. The reason I want to do this is because I want to go back and correct the gene names. I df1 and df2 are much larger than the examples so it isn't easy to catch.

Here is an example of what I am saying so maybe it will make more sense

df1

gene.name ADCY1 ADCY2 ADCY3 ADCY4

df2

gene.name AC1 ADCY2 AC3 ADCY4

filtered

gene.name ADCY2 ADCY4

neuron
  • 1,949
  • 1
  • 15
  • 30
  • 2
    Did you search first? This has been asked many times. – Rich Scriven Mar 27 '19 at 17:26
  • 4
    I'm still not sure what you're trying to get. Are you trying to find genes in `df2` that are not in `df1`? – divibisan Mar 27 '19 at 17:26
  • 1
    Possible duplicate of [How to join (merge) data frames (inner, outer, left, right)?](https://stackoverflow.com/questions/1299871/how-to-join-merge-data-frames-inner-outer-left-right) – NelsonGon Mar 27 '19 at 17:29
  • 1
    The `df1`, `df2`, `filtered` examples at the bottom are a good start. Please add in that same format your desired result. – Gregor Thomas Mar 27 '19 at 17:35
  • @divibisan I am filtering `df1` by `df2` giving `filtered`. I want to know what is different between `filtered` and `df` so for example `df2` is four genes and `filtered` has two genes. I want to know what two genes `df2` has that `filtered` doesn't so I can go back to df2 and find the correct gene name for AC1 and AC3 – neuron Mar 27 '19 at 17:36
  • 1
    @NelsonGon I don't need to merge the data frames – neuron Mar 27 '19 at 17:36
  • 1
    Seems like creating `filtered` isn't really necessary. Add a column to `df1` instead: `df1 %>% mutate(in_df2 = gene.name %in% df2$V1)`. – Gregor Thomas Mar 27 '19 at 17:37
  • 1
    If you want genes in `df2` that are not in `df1`, is this what you are looking for: `df2$gene.name[!df2$gene.name %in% df1$gene.name]`? – Andrew Mar 27 '19 at 17:40
  • @Gregor Oh... wow that actually is perfect – neuron Mar 27 '19 at 17:43
  • 2
    Okay, voting to close as typo. You basically wanted to create a column and used a command to subset instead. (Do please be more specific about your desired output in the future. Showing an example *and* describing in words is best.) – Gregor Thomas Mar 27 '19 at 17:47
  • @Gregor That is fine with me. I am sorry about that. I search for what I was trying to describe but I couldn't seem to find what I was looking for. I will do that in the future. Thank you so much for the help – neuron Mar 27 '19 at 17:50

1 Answers1

1

Another fun way to get the not-in functionality is to write little function like

'%notin%' <- function(x,y)!('%in%'(x,y))

Then you can get the answer in a nice tidy way:

unique_differences<-df2 
    %>% filter(gene.name %notin% unlist(df1))
Joe Erinjeri
  • 1,200
  • 1
  • 7
  • 15
  • That works very nice as well. I changed the name of df2 to `filter` and that gave me some issues because the software thought I was listing `filter` twice. So if anyone is reading this, don't do that. Thanks for the answer!! I appreciate you taking the time to write that out. – neuron Apr 09 '19 at 15:10