I need some help optimising a function. I'm aware that an apply function could speed this up, but I've never learned to use that family of functions properly and can't find advice online that I can follow easily to convert this ...
I have two data frames. They look like this (but each has >1 million rows).
df.DP:
CHR POS
12 463
12 5412
12 76123
431 276
431 200187
8521 23
8521 2001
df.mask:
CHR START END
10 67 876
12 4321 8724
12 8742 8910
277 10293 10599
8521 1068 3233
What I want do do is add a column to df.DP which indicates whether each row matching a CHR value from df.mask also has a POS value that is greater than df.mask START and lower than df.mask END. For example...
result:
CHR POS mask
12 463 0
12 5412 1
12 76123 0
431 276 0
431 200187 0
8521 23 0
8521 2001 1
This is the function I've written:
index.masked <- function(df.DP, df.mask){
#Create all 0s masked index column
df.DP$masked = 0
#Iterate over df
for(i in 1:nrow(df.DP)){
#Report progress
print(i)
#Check if df.DP$CHR[i] is in df.mask$CHR
if(df.DP$CHR[i] %in% df.mask$CHR){
#Check if ith SNP is within masked range
if(nrow(df.mask[which(df.mask$CHR == df.DP$CHR[i] &
df.mask$START < df.DP$POS[i] &
df.mask$END > df.DP$POS[i]),]) > 0){
#Report progress
print("Ding!")
#Set index column to 1
df.DP$masked[i] <- 1
}
}
}
#Return df.DP
return(df.DP)
}
Basically this is horribly slow. As I've said, each data frame has > 1 million rows. On top of that, I have multiple data frames on which I need to perform this operation.
If anyone could please show me how to make this faster I would be very grateful.
Here is some code to generate dummy data...
test.DP <- data.frame(CHR = c("12", "12", "23", "23", "23"),
POS = c(245, 6542, 12, 564, 1874))
test.mask <- data.frame(CHR = c("12", "13", "23"),
START = c(150, 717, 550),
END = c(270, 871, 599))
When I run my function on this dummy data it works fine.
test1 <- index.masked(test.DP, test.mask)
> test1
CHR POS masked
1 12 245 1
2 12 6542 0
3 23 12 0
4 23 564 1
5 23 1874 0
With > 1M rows in each data frame, this is too slow though.
I have searched extensively and while I can find plenty of posts here asking for help with similar problems, I'm so unfamiliar with the apply functions and other approaches to solving things that I just can't follow what's been done and apply (no pun intended) it to my own situation.
Any help would be deeply appreciated.