0

I am trying to go through each value in a data frame and based on that value extract information from another data frame. I have code that works for doing nested for loops but I am working with large datasets that run far too long for that to be feasible.

To simplify, I will provide sample data with initially only one row:

ind_1 <- data.frame("V01" = "pp", "V02" = "pq", "V03" = "pq")
ind_1
#  V01 V02 V03
#1 pp  pq  pq

I also have this data frame:

stratum <- rep(c("A", "A", "B", "B", "C", "C"), 3)
locus <- rep(c("V01", "V02", "V03"), each = 6)
allele <- rep(c("p", "q"), 9)
value <- rep(c(0.8, 0.2, 0.6, 0.4, 0.3, 0.7, 0.5, 0.5, 0.6), 2)
df <- as.data.frame(cbind(stratum, locus, allele, value))
head(df)
#   stratum locus allele value
#1        A   V01      p   0.8
#2        A   V01      q   0.2
#3        B   V01      p   0.6
#4        B   V01      q   0.4
#5        C   V01      p   0.3
#6        C   V01      q   0.7

There are two allele values for each locus and there are three values for stratum for every locus as well, thus there are six different values for each locus. The column name of ind_1 corresponds to the locus column in df. For each entry in ind_1, I want to return a list of values which are extracted from the value column in df based on the locus(column name in ind_1) and the data entry (pp or pq). For each entry in ind_1 there will be three returned values in the list, one for each of the stratum in df.

My attempted code is as follows:

library(dplyr)
library(magrittr)
pop.prob <- function(df, ind_1){
  p <-  df %>%
    filter( locus == colnames(ind_1), allele == "p")
  p <- as.numeric(as.character(p$value))
  if( ind_1 == "pp") {
    prob <- (2 * p * (1-p))
    return(prob)
  } else if ( ind_1 == "pq") {
    prob <- (p^2)
    return(prob)
  } 
}
test <- sapply(ind_1, function(x) {pop.prob(df, ind_1)} )

This code provides a matrix with incorrect values:

      V01  V02  V03
[1,] 0.32 0.32 0.32
[2,] 0.32 0.32 0.32
[3,] 0.42 0.42 0.42

As well as the warning messages:

# 1: In if (ind_1 == "pp") { :
# the condition has length > 1 and only the first element will be used

Ideally, I would have the following output:

> test
# $V01
# 0.32 0.48 0.42
#
# $V02
# 0.25 0.36 0.04
#
# $V03
# 0.16 0.49 0.25

I've been trying to figure out how to NOT use for loops in my code because I've been using nested for loops that take an exorbitant amount of time. Any help in figuring out how to do this for this simplified data set would be greatly appreciated. Once I do that I can work on applying this to a data frame such as ind_1 that has multiple rows

Thank you all, please let me know if the example data are not clear

EDIT

Here's my code that works with a for loop:

pop.prob.for <- function(df, ind_1){
  prob.list <- list()
  for( i in 1:length(ind_1)){
    p <-  df %>%
      filter( locus == colnames(ind_1[i]), allele == "p")
    p <- as.numeric(as.character(p$value))
    if( ind_1[i] == "pp") {
      prob <- (2 * p * (1-p))
    } else if ( ind_1[i] == "pq") {
      prob <- (p^2)
    } 
    prob.list[[i]] <- prob
  }
  return(prob.list)
}
pop.prob.for(df, ind_1)

For my actual data, I would be adding an additional loop to go through multiple rows within a data frame similar to ind_1 and save each of the iterations of lists produced as an .rdata file

dww
  • 30,425
  • 5
  • 68
  • 111
Mgdesaix
  • 113
  • 6
  • There's no reason to assume that `apply` loops will be any faster than `for` - they are also loops, just with a different syntax. E.g. see [here](https://stackoverflow.com/questions/7142767/why-are-loops-slow-in-r/7142982#7142982) and [here](https://stackoverflow.com/questions/5533246/why-is-apply-method-slower-than-a-for-loop-in-r). Could you add your working code with a `for` loop to the question, then we can see if it can be vectorised or otherwise improved, or failing that if you need an RCpp solution. – dww Mar 01 '18 at 19:59
  • @dww I added the `for` loop. I've been struggling with slow code from using nested `for` loops...for example, having code run 24 hours that takes someone else's code 2 hours. Maybe `apply` isn't necessarily the answer, but I'm trying to make my code more efficient. Thanks! – Mgdesaix Mar 01 '18 at 20:34
  • your `apply` code works perfectly for me with the example data. Are you sure everything is loaded correctly? – Pdubbs Mar 01 '18 at 20:35
  • @Pdubbs, I edited my output...I had originally posted it produces null values, but it is actually producing a matrix with incorrect values – Mgdesaix Mar 01 '18 at 20:43
  • Thanks for pointing that out. Answer below. For truly optimizing your speed, I think you'll want to write the function a little differently, but this should clarify how `apply` works and give you a better start – Pdubbs Mar 01 '18 at 20:59

2 Answers2

1

There are two issues with your code. One is that you're apply function is operating on the wrong object, and the other is that you can't access the name of an element through sapply

Right now sapply(ind_1, function(x) {pop.prob(df, ind_1)}) is saying "for each element of ind_1 do pop.prob using df and all of ind_1", hence the incorrect matrix output. To operate element-wise on ind_1 you would write sapply(ind_1, function(x) {pop.prob(df, ind_1)})

This change doesn't work because you extract a column name in your function, and "pp" (the first element) has no column name. To use your function as written, you would need to write:

test <- sapply(1:dim(ind_1)[2], function(x) {pop.prob(df, ind_1[x])})

This way you're iterating in the same manner as your for loop. Note also that you're getting a matrix because sapply attempts to coerce lapply output to a vector or a matrix. If you want a list, simply use lapply

Pdubbs
  • 1,967
  • 2
  • 11
  • 20
  • Thanks for succinct explanation of what `sapply` was attempting to do with my code. Though it appears the two examples in your second paragraph are the same thing. If `ind_1` has multiple rows, would it be effective to put your `sapply` function in a `for` to apply to each row, and save each output? Or would it be better to somehow use the `sapply` inside of another `apply`? Thanks again! – Mgdesaix Mar 01 '18 at 21:07
  • I don't know the specifics of your data, but the solution is probably to rewrite `prob.list` (possibly using apply) so that it will act on an entire column of `ind_1` at once, and then use that as a function you apply across the columns of `ind_1` – Pdubbs Mar 01 '18 at 21:11
0

Here's a vectorised data.table solution. Should be much faster than the apply or for versions. Not to mention far more succinct.

library(data.table)

setDT(df)[, value := as.numeric(as.character(value))]
df[allele=='p', 
     .(prob = {if (ind_1[.GRP]=='pp') 2*value*(1-value) else value^2}), 
     by = locus]

#    locus prob
# 1:   V01 0.32
# 2:   V01 0.48
# 3:   V01 0.42
# 4:   V02 0.25
# 5:   V02 0.36
# 6:   V02 0.04
# 7:   V03 0.16
# 8:   V03 0.49
# 9:   V03 0.25
dww
  • 30,425
  • 5
  • 68
  • 111