5

I got 2 files which I'd like to combine using R.

head(bed)
chr8 41513235 41513282 ANK1.Exon1
chr8 41518973 41519092 ANK1.Exon2

The first one is giving intervals and their names. (Chromosome, from, to, name)

head(coverage)
chr1 41513235 20
chr1 41513236 19
chr1 41513237 19

The second one is giving coverages for single Bases. (Chromosome, position, coverage)

I now want to get the name of each Exon written next to each Position. This will result in some positions with no "Exon" which I want to delete afterwards.

I figured out a ways how to do what I want. However it needs 3 for loops and about 15 hours computing time. Since for loops are not best practice in R I'd like to know if anyone knows a better way than:

coverage <- cbind(coverage, "Exon")
coverage[,4] <- NA

for(i in 1:nrow(bed)){
 for(n in bed[i,2]:bed[i,3]{
  for(m in 1:nrow(coverage)){
   if(coverage[m,2]==n){
    file[m,4] <- bed[i,4]
   }
  }
 }
}

na.omit(coverage)

Since all of the three positions lie in the intervall "ANK1.Exon1", the output should look like this:

head(coverage) 
chr1 41513235 20 ANK1.Exon1 
chr1 41513236 19 ANK1.Exon1 
chr1 41513237 19 ANK1.Exon1 
Stern
  • 101
  • 6
  • 1
    An example output would be useful, but could this be done with a simple merge? – Zfunk May 19 '15 at 08:54
  • I suggest to look at http://stackoverflow.com/questions/1299871/how-to-join-data-frames-in-r-inner-outer-left-right – Agaz Wani May 19 '15 at 08:57
  • The output should look like this: head(coverage) chr1 41513235 20 ANK1.Exon1 chr1 41513236 19 ANK1.Exon1 chr1 41513237 19 ANK1.Exon1 I did not know that it is possible to merge intervals in R. I'm trying to find out more about that now. Thank you. – Stern May 19 '15 at 09:06
  • Hi @Stern, can you edit your original question with this information rather than post it in a comment; it makes it much easier to find and understand for others later. Look in to the `join` commands in `dplyr` package if your data is huge. – Phil May 19 '15 at 09:12
  • hi, is `coverage` and `bed` sorted ? .... the algorithm may to be lineal O(max(N,M)) ... – Jose Ricardo Bustos M. May 19 '15 at 17:38

3 Answers3

5

The fastest way to perform what I was looking for was:

library("sqldf")
res <- sqldf("select * from coverage f1 inner join bed f2
on(f1.position >=f2.'from' and f1.position <=f2.'to')")

The calculation time went down to seconds. To get the exact result as indicated above the dataframe was further reduced.

res <- cbind(res[1:4],res[8])

Thank you all for your help.

Edit: For large datasets were the same positions may appear in more than one Chromosome it is helpfull to further add:

res <- sqldf("select * from coverage f1 inner join bed f2
on(f1.position >=f2.'from' and f1.position <=f2.'to' and f1.Chromosome = f2.Chromosome)")
Stern
  • 101
  • 6
2

this algorithm is lineal, if bed and coverage inputs are sorted, and bed input isn't intevals overlapping

> coverage <- read.table("coverage")
> bed <- read.table("bed")
> 
> coverage <- cbind(coverage, "Exon")
> coverage[,4] <- NA
> 
> i_coverage <- 1
> i_bed <- 1
> 
> while(i_coverage <= length(coverage[,1]) && i_bed <= length(bed[,1])) {
+   if(coverage[i_coverage, 2] < bed[i_bed, 2]){
+     i_coverage <- i_coverage + 1
+   }else{
+     #then coverage[i_coverage, 2] >= bed[i_bed, 2]
+     if(coverage[i_coverage, 2] <= bed[i_bed, 3]){
+       coverage[i_coverage,4] <- as.character(bed[i_bed, 4])
+       i_coverage <- i_coverage + 1
+     }else{
+       i_bed <- i_bed + 1
+     }
+   }
+ }

you get:

> print(coverage)
V1       V2 V3     "Exon"
1 chr1 41513235 20 ANK1.Exon1
2 chr1 41513236 19 ANK1.Exon1
3 chr1 41513237 19 ANK1.Exon1
Jose Ricardo Bustos M.
  • 8,016
  • 6
  • 40
  • 62
2

Using GenomicRanges:

library("GenomicRanges")

#data
x1 <- read.table(text="chr1 41513235 41513282 ANK1.Exon1
chr1 41518973 41519092 ANK1.Exon2")

x2 <- read.table(text="chr1 41513235 20
chr1 41513236 19
chr1 41513237 19")

#Convert to Granges object:
g1 <-  GRanges(seqnames=x1$V1,
               IRanges(start=x1$V2,
                       end=x1$V3),
               Exon=x1$V4)


g2 <-  GRanges(seqnames=x2$V1,
               IRanges(start=x2$V2,
                       end=x2$V2),
               covN=x2$V3)
#merge
mergeByOverlaps(g1,g2)

#output
# DataFrame with 3 rows and 4 columns
#                            g1       Exon                          g2      covN
#                     <GRanges>   <factor>                   <GRanges> <integer>
# 1 chr1:*:[41513235, 41513282] ANK1.Exon1 chr1:*:[41513235, 41513235]        20
# 2 chr1:*:[41513235, 41513282] ANK1.Exon1 chr1:*:[41513236, 41513236]        19
# 3 chr1:*:[41513235, 41513282] ANK1.Exon1 chr1:*:[41513237, 41513237]        19
zx8754
  • 52,746
  • 12
  • 114
  • 209