Comparing two columns in a data frame for matches and from this creating a new data frame that contains the matches

Question

please can you help me again?

I have a data frame that contains 4 columns, which are either a gene symbol or a rank that I have assigned the gene symbol like this:

     mb_rank  mb_gene  ts_rank  ts_gene
[1]  1        BIRCA    1        MYCN
[2]  2        MYCN     2        MOB4
[3]  3        ATXN1    3        ABHD17C
[4]  4        ABHD17C  4        AEBP2
5 etc... for up to 6000 rows in some data sets. 
the ts columns are usually a lot longer than the mb columns.

I want to arrange the data so that non-duplicates are removed thereby leaving only genes that appear in both columns of the data frame e.g.

     mb_rank  mb_gene  ts_rank  ts_gene
[1]  2        MYCN     1        MYCN
[2]  4        ABHD17C  3        ABHD17C

In this example of the desired outcome, the non-duplicated genes have been removed leaving only genes that appeared in both lists to begin with.

I have tried many things like:

`df[df$mb_gene %in% df$ts_gene,]`

but it doesn't work and seems to hit and miss some gene 2) I attempted to write an IF function but my skills are to limited.

I hope I have described this well enough but if I can clarify anything please ask, I'm really stuck. Thanks in advance!

Please add sample data with `dput(head(df,n))` for better reproducibility. — NelsonGon, Apr 16 '20 at 15:35
This is a longer [version](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) of my comment. Essentially, it's currently not straightforward to reproduce your example. — NelsonGon, Apr 16 '20 at 15:39

score 1 · Accepted Answer · answered Apr 16 '20 at 15:47

In a data.frame, typically a row is a complete observation, meaning that all data in it correlates (somehow) with the rest. In a survey, one row is either one person (all questions) or one question for one person. In your data here, though, your first row BIRCA and MYCN are completely separate, meaning you want to remove one without removing the other. In a "data-science-y" view, this to me suggests your data is improperly shaped.

In order to do what you want, we need to split them into separate frames.

df <- read.table(header = TRUE, stringsAsFactors = FALSE, text = "
mb_rank  mb_gene  ts_rank  ts_gene
1        BIRCA    1        MYCN
2        MYCN     2        MOB4
3        ATXN1    3        ABHD17C
4        ABHD17C  4        AEBP2")

df1 <- df[,1:2]
df2 <- df[,3:4]
df1
#   mb_rank mb_gene
# 1       1   BIRCA
# 2       2    MYCN
# 3       3   ATXN1
# 4       4 ABHD17C
df2
#   ts_rank ts_gene
# 1       1    MYCN
# 2       2    MOB4
# 3       3 ABHD17C
# 4       4   AEBP2

From here, we can use intersect to find genes in common:

incommon <- intersect(df1$mb_gene, df2$ts_gene)
df1[df1$mb_gene %in% incommon,]
#   mb_rank mb_gene
# 2       2    MYCN
# 4       4 ABHD17C
df2[df2$ts_gene %in% incommon,]
#   ts_rank ts_gene
# 1       1    MYCN
# 3       3 ABHD17C

If you are 100% certain that you will always have the same number of rows in each, then you can merely cbind these together:

cbind(
  df1[df1$mb_gene %in% incommon,],
  df2[df2$ts_gene %in% incommon,]
)
#   mb_rank mb_gene ts_rank ts_gene
# 2       2    MYCN       1    MYCN
# 4       4 ABHD17C       3 ABHD17C

However, if there is a chance that there will be different numbers in each, then you will run into problems. If the number of one is a multiple of the other, you will get "recycling" of data and a warning, but you will still get data (which I think is a mistake):

cbind(
  df1[df1$mb_gene %in% incommon,],
  df2
)
# Warning in data.frame(..., check.names = FALSE) :
#   row names were found from a short variable and have been discarded
#   mb_rank mb_gene ts_rank ts_gene
# 1       2    MYCN       1    MYCN
# 2       4 ABHD17C       2    MOB4
# 3       2    MYCN       3 ABHD17C
# 4       4 ABHD17C       4   AEBP2

If not a multiple, though, you'll just get an error:

cbind(
  df1[df1$mb_gene %in% incommon,],
  df2[1:3,]
)
# Error in data.frame(..., check.names = FALSE) : 
#   arguments imply differing number of rows: 2, 3

I suggest that you think about this storage structure, as I believe it defeats assumptions that some tools make about rows of a frame.

score 0 · Answer 2 · answered Apr 16 '20 at 15:50

0

Use: df_new is your new dataframe.

df_new = df[df['mb_gene'] == df['ts_gene']]

answered Apr 16 '20 at 15:50

kanna

83
1
6

score 0 · Answer 3 · answered Apr 16 '20 at 15:59

0

Without more details, it's hard to know about edge cases. In any case, it sounds like a relational table join. Have you tried:

d1 = select(df, c(mb_rank, mb_gene))
d2 = select(df, c(ts_rank, ts_gene))
merge(d1, d2, by.x="mb_gene", by.y="ts_gene")

answered Apr 16 '20 at 15:59

F Trias

189
8

Comparing two columns in a data frame for matches and from this creating a new data frame that contains the matches

3 Answers3