1

I have a data.frame where each row is a linear interval - specifically these intervals are start and end coordinates on chromosomes (chr below):

df <- data.frame(chr = c("chr1","chr2","chr2","chr3"),
                 strand = c("+","+","-","-"),
                 start = c(34,23,67,51),
                 end = c(52,49,99,120),
                 stringsAsFactors = F)

A chromosome has tow strands hence the strand column.

I'd like to spread these intervals to a width of 1 thereby replacing the start and end columns with a position column. So far I'm using this:

spread.df <- do.call(rbind,lapply(1:nrow(df),function(i)
  data.frame(chr = df$chr[i], strand = df$strand[i], position = df$start[i]:df$end[i], strand = df$strand[i], stringsAsFactors = F)
))

But for the number of intervals I have and their sizes it's a bit slow. So my question is if there's a faster alternative.

dan
  • 6,048
  • 10
  • 57
  • 125

1 Answers1

1

map2 would be fast

library(dplyr)
library(purrr)
library(tidyr)
df %>% 
  transmute(chr, strand, position = map2(start, end, `:`)) %>% 
   unnest(position)

Or use data.table

library(data.table)
setDT(df)[, .(position = start:end), .(chr, strand)]
akrun
  • 874,273
  • 37
  • 540
  • 662