-2

I am using the binary operator %in% to subset a dataframe (I got the idea from another stackoverflow thread), but when I double check the result by switching the arguments, I get different answers. I've read the R documentation on the match() function, and it seems like neither match() nor %in% should be directionally dependent. I really need to understand exactly what is happening to be confident in my results. Could anybody provide some insight?

> filtered_ordGeneNames_proteinIDs <- ordGeneNames_ProteinIDs[ordGeneNames_ProteinIDs$V4 %in% ordDEGs$X, ];
> filtered2_ordGeneNames_proteinIDs <- ordDEGs[ordDEGs$X %in% ordGeneNames_ProteinIDs$V4, ];
> nrow(filtered_ordGeneNames_proteinIDs)
[1] 5767
> nrow(filtered2_ordGeneNames_proteinIDs)
[1] 5746
Community
  • 1
  • 1
mentler2
  • 3
  • 2

1 Answers1

1

Of course you have different results:

ordGeneNames_ProteinIDs$V4 %in% ordDEGs$X

tells you which element of ordGeneNames_ProteinIDs$V4 that is also in ordDEGs$X

where :

ordDEGs %in% $XordGeneNames_ProteinIDs$V4

tells you which element of ordDEGs$X that is also in ordGeneNames_ProteinIDs$V4

compare

c(1,2,3,4) %in% c(1,2,1, 2)
[1]  TRUE  TRUE FALSE FALSE

to

c(1,2,1, 2) %in% c(1,2,3,4)
[1] TRUE TRUE TRUE TRUE
HubertL
  • 19,246
  • 3
  • 32
  • 51
  • Many thanks. Implicit assumptions got me on this one. Gene names are supposed to be unique. There shouldn't have been any duplicates in the `ordGeneNames_ProteinIDs` dataframe, but a quick check with `duplicated()` told me that there definitely were. – mentler2 Mar 02 '16 at 15:03