0

I have a data frame with a bin size, a series of elements that falls within each bin, and the amount of overlap for each element defined by the factor "state". Below is small example with three different levels (16_Rpt,18_LowL,5_EnhM).

chr bin_start bin_stop state_start state_stop state overlap
chr1 3000000 3500000 3325000 3325800 16_Rpt 800
chr1 3000000 3500000 3325800 3390000 18_LowL 64200
chr1 3000000 3500000 3390000 3390200 5_EnhM 200
chr1 3500000 4000000 3390200 3504800 18_LowL 4800
chr1 3500000 4000000 3504800 3505400 5_EnhM 600
chr1 3500000 4000000 3505400 3541000 18_LowL 35600
chr1 4000000 4500000 3794200 4155600 18_LowL 155600
chr1 4000000 4500000 4155600 4156600 16_Rpt 1000
chr1 4000000 4500000 4156600 4166200 18_LowL 9600

I would like to add the overlaps for all the levels falling in each bin. Eventually, I will delete all the duplicates for each level in a specific bin once the overlaps have been added.

I tried using by to subset the data frame for all repeating bins, and then applying sum over this subset for every level of "state":

df <- by(df[duplicated(df$bin_start) | duplicated(df$bin_start,fromLast = TRUE),],
         df$overlap,
         sum)

However I get the following error:

Error in tapply(seq_len(36386L), list(df$overlap = c(500000L, : arguments must have same length Calls: by ... by.data.frame -> structure -> eval -> eval -> tapply

Can someone point out what is wrong with this approach? (I hope the error is not too out of context since it was given on a much larger frame with many more levels.)

Abdou
  • 12,931
  • 4
  • 39
  • 42
  • What is the expected output for the above example? – Gopala Dec 07 '16 at 03:02
  • At least from the question title I suspect the answer is in the answers to : http://stackoverflow.com/questions/3505701/r-grouping-functions-sapply-vs-lapply-vs-apply-vs-tapply-vs-by-vs-aggrega?s=1|6.2674 – IRTFM Dec 07 '16 at 04:25

0 Answers0