I am trying to go through each value in a data frame and based on that value extract information from another data frame. I have code that works for doing nested for loops but I am working with large datasets that run far too long for that to be feasible.
To simplify, I will provide sample data with initially only one row:
ind_1 <- data.frame("V01" = "pp", "V02" = "pq", "V03" = "pq")
ind_1
# V01 V02 V03
#1 pp pq pq
I also have this data frame:
stratum <- rep(c("A", "A", "B", "B", "C", "C"), 3)
locus <- rep(c("V01", "V02", "V03"), each = 6)
allele <- rep(c("p", "q"), 9)
value <- rep(c(0.8, 0.2, 0.6, 0.4, 0.3, 0.7, 0.5, 0.5, 0.6), 2)
df <- as.data.frame(cbind(stratum, locus, allele, value))
head(df)
# stratum locus allele value
#1 A V01 p 0.8
#2 A V01 q 0.2
#3 B V01 p 0.6
#4 B V01 q 0.4
#5 C V01 p 0.3
#6 C V01 q 0.7
There are two allele values for each locus and there are three values for stratum for every locus as well, thus there are six different values for each locus. The column name of ind_1
corresponds to the locus
column in df
. For each entry in ind_1
, I want to return a list of values which are extracted from the value column in df
based on the locus
(column name in ind_1
) and the data entry (pp
or pq
). For each entry in ind_1
there will be three returned values in the list, one for each of the stratum
in df
.
My attempted code is as follows:
library(dplyr)
library(magrittr)
pop.prob <- function(df, ind_1){
p <- df %>%
filter( locus == colnames(ind_1), allele == "p")
p <- as.numeric(as.character(p$value))
if( ind_1 == "pp") {
prob <- (2 * p * (1-p))
return(prob)
} else if ( ind_1 == "pq") {
prob <- (p^2)
return(prob)
}
}
test <- sapply(ind_1, function(x) {pop.prob(df, ind_1)} )
This code provides a matrix with incorrect values:
V01 V02 V03
[1,] 0.32 0.32 0.32
[2,] 0.32 0.32 0.32
[3,] 0.42 0.42 0.42
As well as the warning messages:
# 1: In if (ind_1 == "pp") { :
# the condition has length > 1 and only the first element will be used
Ideally, I would have the following output:
> test
# $V01
# 0.32 0.48 0.42
#
# $V02
# 0.25 0.36 0.04
#
# $V03
# 0.16 0.49 0.25
I've been trying to figure out how to NOT use for
loops in my code because I've been using nested for loops that take an exorbitant amount of time. Any help in figuring out how to do this for this simplified data set would be greatly appreciated. Once I do that I can work on applying this to a data frame such as ind_1
that has multiple rows
Thank you all, please let me know if the example data are not clear
EDIT
Here's my code that works with a for
loop:
pop.prob.for <- function(df, ind_1){
prob.list <- list()
for( i in 1:length(ind_1)){
p <- df %>%
filter( locus == colnames(ind_1[i]), allele == "p")
p <- as.numeric(as.character(p$value))
if( ind_1[i] == "pp") {
prob <- (2 * p * (1-p))
} else if ( ind_1[i] == "pq") {
prob <- (p^2)
}
prob.list[[i]] <- prob
}
return(prob.list)
}
pop.prob.for(df, ind_1)
For my actual data, I would be adding an additional loop to go through multiple rows within a data frame similar to ind_1
and save each of the iterations of lists produced as an .rdata file