2

I have a problem where I get information on the range of occupied cells. There may be multiple start and end entries of the range which can overlap for the same test. Not all the "test" have entries. I have a data frame in R and want to merge all the unique ranges for each "test".

x<-data.frame(test=c(2,3,3,2,3,4),start=c(1,1,1,2,3,4),end=c(1,2,3,3,4,4))
> x
  test start end
1    2     1   1
2    3     1   2
3    3     1   3
4    2     2   3
5    3     3   4
6    4     4   4

I would like to transform this data frame into:

  test start end
1    2     1   1
2    2     2   3
3    3     1   4
4    4     4   4

In the end I just want to know how many cells are occupied by the range for each "row", so row 2 has (1,1) and (2,3) which means 3 cells. row 3 has (1,4) so 4 cells. row 4 has (4,4) so 1 cell. since row 1 or 5 to n has none occupied, all are 0 cells:

u<-unique(y[,1])
a<-rep(0,length(u))
for(i in 1:length(u)){
  a[i]<-sum(y[which(y[,1]==u[i]),3]-y[which(y[,1]==u[i]),2])+length(which(y[,1]==u[i]))
}
> a
[1] 3 4 1
Tony Hellmuth
  • 290
  • 2
  • 11
  • 1
    This looks like a standard task in bioinformatics. Thus, you should look for available tools. I'm pretty sure that you can use package [IRanges](https://bioconductor.org/packages/release/bioc/html/IRanges.html). – Roland Apr 11 '18 at 07:31
  • I think so, but sadly I would require an answer in base R due to inability to download packages on servers. – Tony Hellmuth Apr 11 '18 at 07:37
  • 1
    I'm sorry, but I consider installing packages on a server a necessity and easier than doing everything in base R. – Roland Apr 11 '18 at 07:38
  • 1
    Why have you tagged your question with non-`base` packages then? These days, `dplyr` and `data.table` are used in parallel with `base` by very many people, so you need to be _very_ explicit if you absolutely not can use anything else than `base`. – Henrik Apr 11 '18 at 07:38
  • These are the only packages installed on the server: foreach, base64enc, bayesm, Formula, class, g.data, cluster, numDeriv, scales, codetools, permute, date, spatial, psy, digest, pwt, statmod, stringr, iterators, lattice, latticeExtra, timeDate, evaluate, tseries, fastcluster, fBasics, XML, Matrix, rjson, zoo, car, plyr, sqldf, dplyr, lubridate, randomForest, survival, data.table, parallel, xts, neuralnet, e1071, caret, deepnet, tm, bit64, glmnet, forecast, reshape2, xgboost, readr – Tony Hellmuth Apr 11 '18 at 07:42
  • 2
    OK! It seems like you have access to both `data.table` and `dplyr`, so the answers above (and links therein) should get you going. Good luck! Cheers – Henrik Apr 11 '18 at 09:13
  • Awesome! I got the method but still some problems with my answers but I understand what to do! Check out my solution: – Tony Hellmuth Apr 12 '18 at 03:12

0 Answers0