1

I am trying to implement a function to get values from one table based on another. The actual dataframes have > 50,000 observations, so implementing this nested for loop is not effective. I've been trying to look through SO the past few days to find something that works, but haven't been able to. My data is in no particular order (individuals, segments, etc), so it needs to be able to work even if things are out of order.

Here are toy examples of my data to work with:

region_map <- data.frame(Start = c(721290, 1688193), End= c(1688192, 2926555))
individual <- c("Ind1","Ind2","Ind3","Ind4")
segment <- data.frame(SampleID = c("Ind1","Ind1","Ind2","Ind2","Ind3","Ind3","Ind4","Ind4","Ind4"),
                      Start = c(721290, 1688194, 721290, 1688200, 721290, 2926600, 721290, 1688193, 690),
                      End = c(1688192, 2926555,1688190, 2900000, 2926555, 3000000, 1500000, 2005000, 500000),
                      State = c(1,2,2,5,4,2,2,6,5))

And here's a simplified example of what I'm trying to do:

Generate.FullSegmentList <- function(segments, individuals, regionmap){
     FullSegments <- data.frame()
     for(region in 1:nrow(regionmap)){

          for(ind in individuals){
               # If there is not a segment within that region for that individual
               if(nrow(
                    segments[segments$start >= regionmap$Start[region] & 
                                  segments$End <= regionmap$End[region] &
                                  segments$SampleID == ind , ]
               ) == 0){
                    Temp <- data.frame(SampleID = ind, 
                                       Start = regionmap$Start[region],
                                       End = regionmap$End[region],
                                       State = 3
                    )
               }
               # If there is a segment within that region for that individual
               if(nrow(
                    segments[segments$Start >= regionmap$Start[region] & 
                                  segments$End <= regionmap$End[region] &
                                  segments$SampleID == ind , ]
               ) == 1){
                    Temp <- data.frame(SampleID = segments$SampleID, 
                                       Start = regionmap$Start[region],
                                       End = regionmap$End[region],
                                       State = segments$State[segments$Start >= regionmap$Start[region] & 
                                                                  segments$SampleID == ind ]
                    )
               }
               FullSegments <- list(FullSegments, Temp)              
          }
     }
     FullSegments
}

In words, I need to look at each region (~53,000) and assign a value (State, if none exists, give value of 3) to the region for each individual, and then create a new data.frame with every region for every individual. To do this, I'm looping through the regions and then the individuals, finding a segment (there are ~25,000 of these) that overlaps with the region and then appending it to the table.

Here is what the output from the above toy data would give:

SampleID       Start       End        State
Ind1          721290      1688192      1
Ind1          1688193     2926555      2
Ind2          721290      1688192      2
Ind2          1688193     2926555      5
Ind3          721290      1688192      4
Ind3          1688193     2926555      4
Ind4          721290      1688192      2
Ind4          1688193     2926555      6

This function as-is works exactly how I need it to, except that it will take a VERY long time to run (using system.time, I got that it would take over 3 months to run). I know there must be a better way to do this. I've tried implementing apply functions, and I saw in some other questions to use lists instead of a data.frame. I also saw that there are data.table and plyr options to simplify this. I've tried these but haven't been successful at getting it to work with the nested loop with if statements.

I would appreciate an explanation of any answers given, as this is the first time I've written anything this complex.

Questions I think are relevant:

Many other questions on nested for loops involve doing calculations that work well for doing an apply function (e.g. apply(df, 1, function(x){ mean(x) } ), but I haven't been able to adopt that to mapping values from data.frame to data.frame.

Community
  • 1
  • 1
Gaius Augustus
  • 940
  • 2
  • 15
  • 37

3 Answers3

2

The Bioconductor package IRanges works on 'integer ranges' like the region and segment start and end coordinates. Install the package with

source("https://bioconductor.org/biocLite.R")
biocLite("IRanges")

Load it and create a representation of the ranges of interest

library(IRanges)
r <- with(region_map, IRanges(Start, End))
s <- with(segments, IRanges(Start, End))

The result so far is

> r
IRanges object with 2 ranges and 0 metadata columns:
          start       end     width
      <integer> <integer> <integer>
  [1]    721290   1688192    966903
  [2]   1688193   2926555   1238363
> s
IRanges object with 9 ranges and 0 metadata columns:
          start       end     width
      <integer> <integer> <integer>
  [1]    721290   1688193    966904
  [2]   1688194   2926555   1238362
  [3]    721290   1688190    966901
  [4]   1688200   2900000   1211801
  [5]    721290   2926555   2205266
  [6]   2926600   3000000     73401
  [7]    721290   1500000    778711
  [8]   1688193   2005000    316808
  [9]       690    500000    499311

You're interested in finding the overlaps between the 'query' segments and the 'subject' region_map

olaps <- findOverlaps(s, r)

giving

> olaps
Hits object with 9 hits and 0 metadata columns:
      queryHits subjectHits
      <integer>   <integer>
  [1]         1           1
  [2]         1           2
  [3]         2           2
  [4]         3           1
  [5]         4           2
  [6]         5           1
  [7]         5           2
  [8]         7           1
  [9]         8           2
  -------
  queryLength: 9 / subjectLength: 2

This will scale well to millions of overlaps.

You said you're interested in the state of all individuals in all regions, and from your code it looks like an individual not in a region has state 3. I created a matrix with all state 3

state <- matrix(3, nrow(region_map), length(individual),
                dimnames=list(NULL, individual))

then created a two-column index into the matrix based on the overlaps we found

idx <- matrix(c(subjectHits(olaps),
                match(segments$SampleID[queryHits(olaps)], individual)),
              ncol=2)

and used the index matrix to update the state

state[idx] <- segments$State[queryHits(olaps)]

This actually summarizes your desired result -- the state in each region x individual combination. One possible issue is when two segments of the same individual overlap a single region, and the segments have different state; only one state will be assigned.

> state
     Ind1 Ind2 Ind3 Ind4
[1,]    1    2    4    2
[2,]    2    5    4    6

Cast it as a data.frame with, e.g.,

data.frame(SampleID=colnames(state)[col(state)],
           Start=region_map[row(state), "Start"],
           End=region_map[row(state), "End"],
           State=as.vector(state))
Martin Morgan
  • 45,935
  • 7
  • 84
  • 112
  • This works for me, and I'm able to understand and modify it for my real data. As an aside, I had to use the GenomicRanges package for my data because I also had chromosome information. It took me a while to understand everything, but thank you for the very thorough and helpful explanations! – Gaius Augustus Mar 24 '16 at 20:31
  • Oh, and I used system.time to time this: user: 0.46, system: 0.06, elapsed: 0.51. Pretty amazing. – Gaius Augustus Mar 24 '16 at 20:50
  • 1
    @GaiusAugustus Sounds like you've had a productive day; if your questions are Bioconductor related it's better to pose them on the [Bioconductor support site](https://support.bioconductor.org) – Martin Morgan Mar 24 '16 at 23:38
  • This worked for me for a while, but after updating data.table, suddenly I'm getting an error: Error in `[.data.table (regionmap, row(state), "Chr") : i is invalid type (matrix). Perhaps in future a 2 column matrix could return a list of elements of DT (in the spirit of A[B] in FAQ 2.14). Please let datatable-help know if you'd like this, or add your comments to FR #657.` Any ideas how to fix it? I really liked the way this code worked. – Gaius Augustus Apr 14 '16 at 02:40
  • @GaiusAugustus my answer doesn't use data.table; maybe you meant to comment on another answer? Or perhaps more clearly indicate where the problem arises with the code that generates it. – Martin Morgan Apr 14 '16 at 09:40
  • That's what I thought when I looked over the code. I didn't see any data.table reference. Thanks for the reply. I probably just broken something modifying the code. – Gaius Augustus Apr 14 '16 at 19:24
1

You have a lot of lines in your code that read nrow(some-subset-of-your-data). You would see a quick performance increase if you switched these to sum(the-conditions). For example:

Turn:

nrow(segments[segments$start >= regionmap$Start[region] & 
                                   segments$End <= regionmap$End[region] &
                                  segments$SampleID == ind , ]) == 0

Into

sum(segments$start >= regionmap$Start[region] & 
                                   segments$End <= regionmap$End[region] &
                                  segments$SampleID == ind) == 0

This way, R doesn't store your subsetted data frame in memory every time.

In addition, store this operation as a boolean so you only need to call it once in each loop.

isEmpty <- sum(segments$start >= regionmap$Start[region] & 
                                   segments$End <= regionmap$End[region] &
                                  segments$SampleID == ind) == 0

if(isEmpty){
### do something
} else if(!isEmpty) {
### do something else
}
Chris Conlan
  • 2,774
  • 1
  • 19
  • 23
0

I don't think you need anything 'this complex'. You can do everything you're after with a couple of joins. In this case I'll be using data.table.

You've asked for an explanation of any answer, however, for this I can't do better than point you in the direction of the data.table homepage. It will be important to understand what the set* and := commands do and how 'update by-reference' works.

Set your data to data.tables.

library(data.table)

dt_individual <- data.table(SampleID = individual)
dt_region <- data.table(region_map)
dt_segment <- data.table(segment)

Just join it all together

## Change some column names of `dt_segment` so we can identify them after the joins
setnames(dt_segment, c("Start", "End"), c("seg_Start", "seg_End"))

## create a 'key_col' to join all the individuals to the regions
dt_join <- dt_individual[, key_col := 1][ dt_region[, key_col := 1], on="key_col", allow.cartesian=T][, key_col := NULL]
#    SampleID   Start     End
# 1:     Ind1  721290 1688192
# 2:     Ind2  721290 1688192
# 3:     Ind3  721290 1688192
# 4:     Ind4  721290 1688192
# 5:     Ind1 1688193 2926555
# 6:     Ind2 1688193 2926555
# 7:     Ind3 1688193 2926555
# 8:     Ind4 1688193 2926555

Now use the foverlaps function to find the overlapping regions

setkey(dt_join, SampleID, Start, End)
setkey(dt_segment, SampleID, seg_Start, seg_End)

foverlaps(dt_join,
          dt_segment,
          type="any")

#    SampleID seg_Start seg_End State   Start     End
# 1:     Ind1    721290 1688192     1  721290 1688192
# 2:     Ind1   1688194 2926555     2 1688193 2926555
# 3:     Ind2    721290 1688190     2  721290 1688192
# 4:     Ind2   1688200 2900000     5 1688193 2926555
# 5:     Ind3    721290 2926555     4  721290 1688192
# 6:     Ind3    721290 2926555     4 1688193 2926555
# 7:     Ind4    721290 1500000     2  721290 1688192
# 8:     Ind4   1688193 2005000     6 1688193 2926555

To see all the data (i.e. both those that fall within the regions and those that don't), you can do a cartesian join, and then assign values to those within the region and those outside it as you wish

dt_join[dt_segment, on="SampleID", nomatch=0, allow.cartesian=T]
SymbolixAU
  • 25,502
  • 4
  • 67
  • 139
  • I'm a little confused by this. 1) You have 4 outputs for Ind3 when I only want the 2 from the region file (in my real data, every segment would fall within >= 1 region) 2) How do I modify this such that segments that are outside required intervals are given a value (value = 3 in my data)? I've used the data.table package, but never for something this complex. – Gaius Augustus Mar 24 '16 at 19:52
  • To clarify, note that my output has 1 row per region from the regions file for each individual, with the state within that region (identified by a segment that falls in that region). Whereas your output, for example, line 2, has non-overlapping regions and the listed state. – Gaius Augustus Mar 24 '16 at 20:03
  • 1
    @GaiusAugustus - I've changed my answer to use `foverlaps`. – SymbolixAU Mar 24 '16 at 20:46