Identify identical columns in two data frames and extract them in r

Question

I have two data frames; mRNA (here) and RPPA(here). The mRNA data frame has 1,212 columns, while the RPPA data frame has 937 columns. All columns names in the RPPA data frame appear also in the mRNA data frame (but not in the same order). Within the columns, the values are different between the two data frames.
I want to create a new mRNA data frame, which will contain the same columns as the RPPA data frame, and will not contain the columns that do not appear in the ("old") mRNA data frame.
An example:

mRNA <- data.frame(A=c(25,76,23,45), B=c(56,89,12,452), C=c(45,456,243,5), D=c(13,65,23,16), E=c(17:20), F=c(256,34,0,5))  
RPPA <- data.frame(B=c(46,47,45,49), A=c(51,87,34,87), D=c(76,34,98,23))

The expected result would be:

> new.mRNA
B     A     D
56    25    13
89    76    65
12    23    23
452   45    16

I've tried converting the RPPA column names into a vector, and than use it with the command mRNA[col.names.vector], as described here, but it doesn't work. It gives the error undefined columns selected.

Is there a quick way to do it (without functions, loops etc.)?

Please check if you have leading/lagging spaces in your column names — akrun, Jan 15 '17 at 18:45
@deborah It is easy to check. `colnames(mRNA); colnames(RPPA)` — akrun, Jan 16 '17 at 12:19
@akrun I don't think I have spaces, but I do have numerous dots. example of column name: **TCGA.3C.AALI.01A.21.A43F.20**. Is that a problem? — Debby, Jan 16 '17 at 12:31
It could be a problem. Check whether you have the same dots in both of the dataset colum names — akrun, Jan 16 '17 at 12:32
Yes, the dots are the same. I've added a link to the files, if you could maybe view it it would be very helpful. — Debby, Jan 16 '17 at 12:44

score 1 · Answer 1 · answered Jan 17 '17 at 11:27

Both of the answers that were posted didn't work for my data. Thanks to both answers posted, and with a little more research, I figured out the answer: First, you need to generate a vector that will include ONLY the column names that appear in BOTH data frames. In order to do that I used the command intersect and Reduce:

target <- Reduce(intersect, list(colnames(raw.mRNA), colnames(RPPA)))

Now you can use the answer that was given:

new.mRNA <- mRNA[target]

and this will generate a new data frame with the right values.
Thank you @akrun and @Titolondon for your help

simranpal kohli · Answer 2 · 2020-05-02T01:53:58.193

1

You can find the dissimilar columns in two data frames as per the below code.

col_name=colnames(mRNA[which(!(colnames(mRNA) %in% colnames(RPPA)))])

new_mRNA=mRNA %>% select(-col_name)

edited May 02 '20 at 01:53

answered May 02 '20 at 01:44

simranpal kohli

21
3

score 0 · Answer 3 · answered Jan 15 '17 at 17:30

0

We can subset the mRNA by the column names of 'RPPA' and assign it to 'RPPA'

RPPA[] <- mRNA[names(RPPA)]

answered Jan 15 '17 at 17:30

akrun

874,273
37
540
662

score 0 · Answer 4 · answered Jan 15 '17 at 17:34

0

Subset of a data.frame with a vector should have work.

Create a vector of the column name you want to keep
Subset you data.frame using this vector

mRNA <- data.frame(A=c(25,76,23,45), B=c(56,89,12,452), C=c(45,456,243,5), D=c(13,65,23,16), E=c(17:20), F=c(256,34,0,5))  
RPPA <- data.frame(B=c(46,47,45,49), A=c(51,87,34,87), D=c(76,34,98,23))  

mRNA
#>    A   B   C  D  E   F
#> 1 25  56  45 13 17 256
#> 2 76  89 456 65 18  34
#> 3 23  12 243 23 19   0
#> 4 45 452   5 16 20   5
RPPA
#>    B  A  D
#> 1 46 51 76
#> 2 47 87 34
#> 3 45 34 98
#> 4 49 87 23
mRNA[, names(RPPA)]
#>     B  A  D
#> 1  56 25 13
#> 2  89 76 65
#> 3  12 23 23
#> 4 452 45 16

answered Jan 15 '17 at 17:34

cderv

6,272
1
21
31

How is this answer different from mine? – akrun Jan 15 '17 at 17:42
Thanks for the quick reply. Actually, both answers are not yet what I'm looking for. When I'm doing akrun's answer, all values in the RPPA data frame are becoming NA's. When I do Titolondon's answer, I get again the error described above (undefined columns selected). – Debby Jan 15 '17 at 17:48
@deborah I couldn't reproduce your NA's based on the example you provided. I am getting the expected output as you showed – akrun Jan 15 '17 at 17:58
@akrun The row names are different between mRNA and RPPA. Could that be the reason? – Debby Jan 15 '17 at 18:03
@deborah Please try the code on the example you only posted – akrun Jan 15 '17 at 18:04
@akrun It is working with the example, but not with my actual data... I can't figure what's the problem – Debby Jan 15 '17 at 18:44

Identify identical columns in two data frames and extract them in r

4 Answers4