1

I have a data frame, and two of the columns are indices for another data fame. I want to add a column to the first by indexing the second, but just calling the column names isn't working. For example, if the first data frame is :

...  Gene    CellLine ...
     KRAS    HELA     ...
     BRCA1   T24      ...

and my second dataframe looks like

        KRAS   BRCA1 ...
HELA    5      3
T24     2      1
...

I want the output to look like

...  Gene   CellLine   Dependency ...
     KRAS   HELA       5          ...
     BRCA1  T24        1          ...

without having to loop through the lines because the first data frame is massive. That is, is there any function or package that would do the equivalent to

for (i in rownames(table1)){
  table1[i, dependency] <- ifelse(table1[i,"Gene"] %in% rownames(table2) & table1[i,"CellLine"] %in% colnames(table2), table2[table1[i,"Gene"],table1[i,"CellLine"]], NA)
}

but faster?

Thanks!

  • [Reshape](https://stackoverflow.com/questions/2185252/reshaping-data-frame-from-wide-to-long-format) your second data frame from wide to long format and then [join](https://stackoverflow.com/questions/1299871/how-to-join-merge-data-frames-inner-outer-left-right) on Gene and CellLine. – Shubham Pujan Jul 21 '20 at 17:32

2 Answers2

1

The following code is vectorized, it creates an index matrix with the two columns from df1 and uses it to extract the required values from df2.

inx <- as.matrix(df1[c("CellLine", "Gene")])
df1$Dependency <- df2[inx]

df1
#   Gene CellLine Dependency
#1  KRAS     HELA          5
#2 BRCA1      T24          1

Data

df1 <- read.table(text = "
Gene    CellLine 
KRAS    HELA
BRCA1   T24 
", header = TRUE)

df2 <- read.table(text = "
        KRAS   BRCA1
HELA    5      3
T24     2      1
", header = TRUE)
Rui Barradas
  • 70,273
  • 8
  • 34
  • 66
  • Works great, thanks. For anyone who wants to add the ifelse condition, I just separated the table1 into rows that had values in the table2 index and those that didn't, and did roughly : df1_no_dependency$dependency <- NA and df1_w_dependency$dependency <- df2[inx] and then rbind(df1_no_dependency, df1_w_dependency) – Kamran Elahi Jul 21 '20 at 19:09
  • @KamranElahi If your first data.frame is massive, then this solution is about as fast as it gets. Unlike `ifelse`, a notoriously slow function. The drawback is the extra object needed, consuming more memory. – Rui Barradas Jul 21 '20 at 19:12
0

You can try this approach. The data used is next:

#Data
df1 <- structure(list(Gene = c("KRAS", "BRCA1"), CellLine = c("HELA", 
"T24")), class = "data.frame", row.names = c(NA, -2L))
df2 <- structure(list(id = c("HELA", "T24"), KRAS = c(5L, 2L), BRCA1 = c(3L, 
1L)), class = "data.frame", row.names = c(NA, -2L))

Then the code, you can melt and merge data:

library(reshape)
#Melt df2 
Melted <- melt(df2,id.vars = 'id')
#Now merge
Merged <- merge(df1,Melted,by.x=c('Gene','CellLine'),by.y=c('variable','id'),all.x=T)

The result would be next:

   Gene CellLine value
1 BRCA1      T24     1
2  KRAS     HELA     5
Duck
  • 39,058
  • 13
  • 42
  • 84