I am trying to implement a function to get values from one table based on another. The actual dataframes have > 50,000 observations, so implementing this nested for loop is not effective. I've been trying to look through SO the past few days to find something that works, but haven't been able to. My data is in no particular order (individuals, segments, etc), so it needs to be able to work even if things are out of order.
Here are toy examples of my data to work with:
region_map <- data.frame(Start = c(721290, 1688193), End= c(1688192, 2926555))
individual <- c("Ind1","Ind2","Ind3","Ind4")
segment <- data.frame(SampleID = c("Ind1","Ind1","Ind2","Ind2","Ind3","Ind3","Ind4","Ind4","Ind4"),
Start = c(721290, 1688194, 721290, 1688200, 721290, 2926600, 721290, 1688193, 690),
End = c(1688192, 2926555,1688190, 2900000, 2926555, 3000000, 1500000, 2005000, 500000),
State = c(1,2,2,5,4,2,2,6,5))
And here's a simplified example of what I'm trying to do:
Generate.FullSegmentList <- function(segments, individuals, regionmap){
FullSegments <- data.frame()
for(region in 1:nrow(regionmap)){
for(ind in individuals){
# If there is not a segment within that region for that individual
if(nrow(
segments[segments$start >= regionmap$Start[region] &
segments$End <= regionmap$End[region] &
segments$SampleID == ind , ]
) == 0){
Temp <- data.frame(SampleID = ind,
Start = regionmap$Start[region],
End = regionmap$End[region],
State = 3
)
}
# If there is a segment within that region for that individual
if(nrow(
segments[segments$Start >= regionmap$Start[region] &
segments$End <= regionmap$End[region] &
segments$SampleID == ind , ]
) == 1){
Temp <- data.frame(SampleID = segments$SampleID,
Start = regionmap$Start[region],
End = regionmap$End[region],
State = segments$State[segments$Start >= regionmap$Start[region] &
segments$SampleID == ind ]
)
}
FullSegments <- list(FullSegments, Temp)
}
}
FullSegments
}
In words, I need to look at each region (~53,000) and assign a value (State
, if none exists, give value of 3) to the region for each individual
, and then create a new data.frame with every region for every individual. To do this, I'm looping through the regions and then the individuals, finding a segment
(there are ~25,000 of these) that overlaps with the region and then appending it to the table.
Here is what the output from the above toy data would give:
SampleID Start End State
Ind1 721290 1688192 1
Ind1 1688193 2926555 2
Ind2 721290 1688192 2
Ind2 1688193 2926555 5
Ind3 721290 1688192 4
Ind3 1688193 2926555 4
Ind4 721290 1688192 2
Ind4 1688193 2926555 6
This function as-is works exactly how I need it to, except that it will take a VERY long time to run (using system.time, I got that it would take over 3 months to run). I know there must be a better way to do this. I've tried implementing apply functions, and I saw in some other questions to use lists instead of a data.frame. I also saw that there are data.table and plyr options to simplify this. I've tried these but haven't been successful at getting it to work with the nested loop with if statements.
I would appreciate an explanation of any answers given, as this is the first time I've written anything this complex.
Questions I think are relevant:
Many other questions on nested for loops involve doing calculations that work well for doing an apply function (e.g. apply(df, 1, function(x){ mean(x) }
), but I haven't been able to adopt that to mapping values from data.frame to data.frame.