2

I am trying to develop a function in R to output random positions in list of given intervals.

My intervals file (14,600 lines) is a tab delimited bed file (chromosome start end name) that looks like this:

1      4953    16204   1
1      16284   16612   1
1      16805   17086   1
1      18561   18757   1
1      18758   19040   1
1      19120   19445   1

Currently my function will generate N random positions within these intervals.

sim_dat <- bpSim(N=10)
head(sim_dat)

  seqnames    start      end width strand
1       1 22686939 22686939     1      *
2       1 14467770 14467770     1      *
3       2 10955472 10955472     1      *
4        X   823201   823201     1      *
5        6 10421738 10421738     1      *
6       17 21827745 21827745     1      *

library(GenomicRanges)
library(rtracklayer)

bpSim <- function(intervals="intervals.bed", N=100, write=F) {
  intFile <- import.bed(intervals)
  space <- sum(width(intFile))
  positions <- sample(c(1:space), N)
  cat("Simulating", N, "breakpoints", sep = " ", "\n")
  new_b <- GRanges(
    seqnames = as.character(rep(seqnames(intFile), width(intFile))),
    ranges = IRanges(start = unlist(mapply(seq, from = start(intFile), to = end(intFile))), width = 1)
  )
  bedOut <- new_b[positions]
  if (write) {
    export.bed(new_b[positions], "simulatedBPs.bed")
  }
  remove(new_b)
  return(data.frame(bedOut))
}

This works, however as I'm not particularly familiar with the GenomicRanges package it's something that I've rather hacked together. I would much prefer to be able to re-write this using base R or packages from tidyverse, so that I can adjust it to, for example, allow the user to specify the chromosome.

It also takes a long time - even for N=10:

system.time(sim_dat <- bpSim(N=10))
Simulating 10 breakpoints 
   user  system elapsed 
 10.689   3.267  13.970 

Ultimately, I'm trying to simulate random positions in a genome, and would therefore need to simulate data hundreds of times for each N.

I would greatly appreciate any advice on how I can:

  • Decrease run time
  • Remove the need for GenomicRanges

In addition - if anyone knows any packages that does this already I would much rather use an exiting package rather than re-invent the wheel.

fugu
  • 6,417
  • 5
  • 40
  • 75
  • 1
    `bedtools` `random` or `shuffle` are lifesavers. Just write a simple R wrapper for bedtools and that's it. – pogibas Mar 07 '18 at 10:43
  • 1
    @PoGibas - this was something that I considered. However, the reason I am generating positions from an interval file is that I am excluding the possibility of generating positions in `unmappable` regions of the genome. As far as I can see, bedtools shuffle only allows you to generate random positions along the length of each chromosome (rather than positions within) – fugu Mar 07 '18 at 10:54
  • 3
    bedtools shuffle has option `-excel` where you can specify gaps in assembly OR you can use `-incl` to specify your intervals – pogibas Mar 07 '18 at 10:58

1 Answers1

6

With ranges being different lengths, I'm assuming you want these randomly chosen positions to be proportional to the length of the segment. In other words, selection is uniform based on actual base pairs within the ranges. Otherwise you'll be over-representing small ranges (higher marker density) and under-representing large ranges (lower marker density).

Here's a data.table solution that can do one thousand sites pretty much instantly, and one million random sites in about 10 seconds on my machine. It randomly samples the number of sites you want, first by sampling rows (weighted by each row's range size), then sampling uniformly within that range.

library(data.table)

nSites <- 1e4

bed <- data.table(chromosome=1, start=c(100,1050,3600,4000,9050), end=c(1000,3000,3700,8000,20000))

# calculate size of range
bed[, size := 1 + end-start]

# Randomly sample bed file rows, proportional to the length of each range
simulated.sites <- bed[sample(.N, size=nSites, replace=TRUE, prob=bed$size)]

# Randomly sample uniformly within each chosen range
simulated.sites[, position := sample(start:end, size=1), by=1:dim(simulated.sites)[1]]

# Remove extra columns and format as needed
simulated.sites[, start  := position]
simulated.sites[, end := position]
simulated.sites[, c("size", "position") := NULL]

This starts with a table such as:

 chromosome start   end  size
          1   100  1000   901
          1  1050  3000  1951
          1  3600  3700   101
          1  4000  8000  4001
          1  9050 20000 10951

With an output such as:

       chromosome start   end
    1:          1 10309 10309
    2:          1  4578  4578
    3:          1  1984  1984
    4:          1 14703 14703
    5:          1 10090 10090
   ---
 9996:          1  1601  1601
 9997:          1  5317  5317
 9998:          1 18918 18918
 9999:          1  1154  1154
10000:          1  7343  7343
caw5cv
  • 701
  • 3
  • 9