0

I have a simple for-loop for pattern matching and and getting the values from another matrix. It is bit slow to run for large number of rows. I am trying to convert it into a function and then using apply. But I am not getting the same result as the for loop. Can someone tell me what am I doing wrong. Thanks

Here is the for loop:

exp_target_com = structure(list(X06...2239_normal = c(12.2528814946075,  8.25298920937508), X06...2239_tumor = c(12.476021286337, 6.08504757235585), Ensembl_Id = structure(c(NA_integer_, 
NA_integer_), .Label = "", class = "factor"), HGNC = structure(c(NA_integer_, 
NA_integer_), .Label = "", class = "factor")), .Names = c("X06...2239_normal", "X06...2239_tumor", "Ensembl_Id", "HGNC"), class = "data.frame", row.names = c("A_23_P117082", "A_33_P3246448"))

head(exp_target_com)
#>               X06...2239_normal X06...2239_tumor Ensembl_Id HGNC
#> A_23_P117082          12.252881        12.476021       <NA> <NA>
#> A_33_P3246448          8.252989         6.085048       <NA> <NA>


probe_anno = structure(c("A_23_P117082", "A_33_P3246448", "NM_015987", "NM_080671", "NM_015987", "NM_080671", "ENSG00000013583", "ENSG00000152049", 
"HEBP1", "KCNE4"), .Dim = c(2L, 5L), .Dimnames = list(c("44693", 
"31857"), c("Probe.ID", "SystematicName", "refseq_biomart", "Ensembl_Id", 
"HGNC")))

probe_anno
#>            Probe.ID SystematicName refseq_biomart      Ensembl_Id  HGNC
#> 44693  A_23_P117082      NM_015987      NM_015987 ENSG00000013583 HEBP1
#> 31857 A_33_P3246448      NM_080671      NM_080671 ENSG00000152049 KCNE4



for(i in 1:nrow(exp_target_com)) {
  pos <- which(as.character(probe_anno$Probe.ID) == rownames(exp_target_com)[i])
  if(length(pos) > 0) {
    exp_target_com[i,3] <- as.character(probe_anno$Ensembl_Id)[pos[1]]
    exp_target_com[i,4] <- as.character(probe_anno$HGNC)[pos[1]]
    }
 }

Here is the function and apply

get_anno <- function(data_row, probe_anno) {
  pos <- which(as.character(probe_anno$Probe.ID) == rownames(data_row))
  if (length(pos) > 0) {
    data_row$Ensembl_Id <- as.character(probe_anno$Ensembl_Id)[pos[1]]
    data_row$HGNC <- as.character(probe_anno$HGNC)[pos[1]]
  }
  return(data_row) 
}

apply(exp_target_com, c(1,2), FUN = function(x) get_anno(x, probe_anno))
Rekyt
  • 354
  • 1
  • 8
Konika Chawla
  • 141
  • 2
  • 5
  • 2
    If you could provide a [minimal reproducible example](https://stackoverflow.com/a/5963610/2359523), that would help to pin down the issue. – Anonymous coward Oct 11 '18 at 14:52
  • This will be hard to help you with unless you provide an example dataset that we can use to test your code and our solutions. – see24 Oct 11 '18 at 14:53
  • Aside from the need to provide an mre, it is a common misunderstanding that `apply` should be faster than a `for` loop. – dww Oct 11 '18 at 15:02
  • ok, I have taken a subset and given dput output, is that ok? – Konika Chawla Oct 11 '18 at 15:17
  • 1
    Are you trying to loop over the data.frame row by row? If so, you want to set the `MARGIN=` argument for `apply` to `1`. You have it set to `c(1,2)` which applies it to each **cell** in the table – divibisan Oct 11 '18 at 15:20
  • We also need `probe_anno` to reproduce this data – divibisan Oct 11 '18 at 15:24
  • Hi, yes, I need the values of Ensembl_Id and HGNC for each probe id, in the table exp_target_com. I have tried MARGIN values 1 and 2 and c(1,2) none of them are giving me the values. While for-loop is fetching the right values – Konika Chawla Oct 11 '18 at 15:25
  • Added the required rows from probe_anno – Konika Chawla Oct 11 '18 at 15:35
  • I don't think `apply` will be any faster than `for`. That's not what will solve your problems. My impression is that what you want is `merge` ... I might be wrong as you haven't said exactly what you want to achieve, but you are adding something from one data frame to another, and this is what `merge` is for. – lebatsnok Oct 11 '18 at 15:45
  • ... alternatively you might use `match`. Btw, I think it might be useful to `dput` slightly more rows than 2, maybe 10 or 20. And `dput` the `probe_anno` without `as.matrix`. When I ran your `for` code, I got warnings on invalid factor levels, so you might want to take care that you have character vectors rather than factors in your data frames. And a verbal (short) description of what to achieve might help others to help you. (You're concentrating on technicalities such as replacing `for`with `apply` but `apply` uses `for` so that's unlikely to help.) – lebatsnok Oct 11 '18 at 15:50

1 Answers1

0

Agreeing with the comments, this looks like it will be simpler and faster to use a built-in function like merge or the dplyr equivalent join functions. Here I convert the rowname to a column and use that to join with probe_anno.

library(dplyr)
exp_target_com2 <- exp_target_com %>% 
  select(-3, -4) %>%
  tibble::rownames_to_column("Probe.ID") %>%
  left_join(probe_anno %>% as.data.frame(), by = ("Probe.ID"))



> exp_target_com2
       Probe.ID X06...2239_normal X06...2239_tumor SystematicName refseq_biomart      Ensembl_Id  HGNC
1  A_23_P117082         12.252881        12.476021      NM_015987      NM_015987 ENSG00000013583 HEBP1
2 A_33_P3246448          8.252989         6.085048      NM_080671      NM_080671 ENSG00000152049 KCNE4
Jon Spring
  • 55,165
  • 4
  • 35
  • 53