-1

I have a bed file containing restriction fragments of the mouse genome. Each fragment has a different length/width, like this:

chr   start     end   width
1 chr1       0 3000534 3000534
2 chr1 3000535 3000799     264
3 chr1 3000800 3001209     409
4 chr1 3001210 3001496     286
5 chr1 3001497 3002121     624

Is it possible to combine shorter fragments ( < 500bp) with adjacent fragments using R (see example below) and if yes how?

chr   start     end   width
1 chr1       0 3000534 3000534
2 chr1 3000535 3001209     673    
3 chr1 3001210 3002121     910

Note, I don't want to filter out fragments under a certain length, so sub setting the data is not an option.

I hope my question is not too confusing…

Jeannine
  • 1
  • 2
  • @Jeanine Do you want to combine the fragments only if both the adjacent fragments are <500 width? In the expected result, 3rd row seems to be not the case. – akrun Oct 27 '14 at 06:34
  • @akrun. Ideally every fragment <500 should be added on to the next one until a minimum fragment size of 500 is achieved (even if that requires combining more than 2 fragments). – Jeannine Oct 27 '14 at 07:04
  • @Jeanine Thanks, I didn't catch the until minimum fragment size earlier – akrun Oct 27 '14 at 07:06
  • what about the 12bp wide segment ? – Cath Oct 27 '14 at 08:07
  • @Jeannine [This post](http://stackoverflow.com/questions/15466880/cumulative-sum-until-maximum-reached-then-repeat-from-zero-in-the-next-row) might help you solve your problem. I tried but couldn't directly apply the methods. However I'm sure they can be used in some way. A `for`-loop would be easy but painfully slow on large datasets (as I imagine you have). – Anders Ellern Bilgrau Oct 27 '14 at 08:30

1 Answers1

0

Here is a first solution, that supposes that chr stays the same and that filters out the last fragment if it is < 500 (the result is the dataframe you put in your example) :

mydata<-data.frame(chr=rep("chr1",6),start=c(0,3000535,3000800,3001210,3001497,3002122),end=c(3000534,3000799,3001209,3001496,3002121,3002134),width=c(3000534,264,409,286,624,12),stringsAsFactors=F)

i<-1
while(i<nrow(mydata)){
    if(mydata$width[i]>=500) {
        i<-i+1
    } else {
        mydata$end[i]<-mydata$end[i+1]
        mydata$width[i]<-sum(mydata$width[i:(i+1)])
        mydata<-mydata[-(i+1),]
    }
}
if(mydata$width[i]<500) mydata<-mydata[-i,]
Cath
  • 23,906
  • 5
  • 52
  • 86
  • Your code works, but only on small data frames. However, my data frame has 6x10^6 rows, and even if I split it in smaller junks it still takes ages… Does anyone has any other suggestions? – Jeannine Oct 28 '14 at 05:23