I am trying to develop a function in R to output random positions in list of given intervals.
My intervals file (14,600 lines) is a tab delimited bed
file (chromosome start end name
) that looks like this:
1 4953 16204 1
1 16284 16612 1
1 16805 17086 1
1 18561 18757 1
1 18758 19040 1
1 19120 19445 1
Currently my function will generate N
random positions within these intervals.
sim_dat <- bpSim(N=10)
head(sim_dat)
seqnames start end width strand
1 1 22686939 22686939 1 *
2 1 14467770 14467770 1 *
3 2 10955472 10955472 1 *
4 X 823201 823201 1 *
5 6 10421738 10421738 1 *
6 17 21827745 21827745 1 *
library(GenomicRanges)
library(rtracklayer)
bpSim <- function(intervals="intervals.bed", N=100, write=F) {
intFile <- import.bed(intervals)
space <- sum(width(intFile))
positions <- sample(c(1:space), N)
cat("Simulating", N, "breakpoints", sep = " ", "\n")
new_b <- GRanges(
seqnames = as.character(rep(seqnames(intFile), width(intFile))),
ranges = IRanges(start = unlist(mapply(seq, from = start(intFile), to = end(intFile))), width = 1)
)
bedOut <- new_b[positions]
if (write) {
export.bed(new_b[positions], "simulatedBPs.bed")
}
remove(new_b)
return(data.frame(bedOut))
}
This works, however as I'm not particularly familiar with the GenomicRanges package it's something that I've rather hacked together. I would much prefer to be able to re-write this using base R
or packages from tidyverse
, so that I can adjust it to, for example, allow the user to specify the chromosome.
It also takes a long time - even for N=10
:
system.time(sim_dat <- bpSim(N=10))
Simulating 10 breakpoints
user system elapsed
10.689 3.267 13.970
Ultimately, I'm trying to simulate random positions in a genome, and would therefore need to simulate data hundreds of times for each N
.
I would greatly appreciate any advice on how I can:
- Decrease run time
- Remove the need for
GenomicRanges
In addition - if anyone knows any packages that does this already I would much rather use an exiting package rather than re-invent the wheel.