please can you help me again?
I have a data frame that contains 4 columns, which are either a gene symbol or a rank that I have assigned the gene symbol like this:
mb_rank mb_gene ts_rank ts_gene
[1] 1 BIRCA 1 MYCN
[2] 2 MYCN 2 MOB4
[3] 3 ATXN1 3 ABHD17C
[4] 4 ABHD17C 4 AEBP2
5 etc... for up to 6000 rows in some data sets.
the ts columns are usually a lot longer than the mb columns.
I want to arrange the data so that non-duplicates are removed thereby leaving only genes that appear in both columns of the data frame e.g.
mb_rank mb_gene ts_rank ts_gene
[1] 2 MYCN 1 MYCN
[2] 4 ABHD17C 3 ABHD17C
In this example of the desired outcome, the non-duplicated genes have been removed leaving only genes that appeared in both lists to begin with.
I have tried many things like:
`df[df$mb_gene %in% df$ts_gene,]`
but it doesn't work and seems to hit and miss some gene
2) I attempted to write an IF
function but my skills are to limited.
I hope I have described this well enough but if I can clarify anything please ask, I'm really stuck. Thanks in advance!