I have a data frame with a bin size, a series of elements that falls within each bin, and the amount of overlap for each element defined by the factor "state". Below is small example with three different levels (16_Rpt,18_LowL,5_EnhM
).
chr bin_start bin_stop state_start state_stop state overlap
chr1 3000000 3500000 3325000 3325800 16_Rpt 800
chr1 3000000 3500000 3325800 3390000 18_LowL 64200
chr1 3000000 3500000 3390000 3390200 5_EnhM 200
chr1 3500000 4000000 3390200 3504800 18_LowL 4800
chr1 3500000 4000000 3504800 3505400 5_EnhM 600
chr1 3500000 4000000 3505400 3541000 18_LowL 35600
chr1 4000000 4500000 3794200 4155600 18_LowL 155600
chr1 4000000 4500000 4155600 4156600 16_Rpt 1000
chr1 4000000 4500000 4156600 4166200 18_LowL 9600
I would like to add the overlaps for all the levels falling in each bin. Eventually, I will delete all the duplicates for each level in a specific bin once the overlaps have been added.
I tried using by
to subset the data frame for all repeating bins, and then applying sum over this subset for every level of "state":
df <- by(df[duplicated(df$bin_start) | duplicated(df$bin_start,fromLast = TRUE),],
df$overlap,
sum)
However I get the following error:
Error in tapply(seq_len(36386L), list(
df$overlap
= c(500000L, : arguments must have same length Calls: by ... by.data.frame -> structure -> eval -> eval -> tapply
Can someone point out what is wrong with this approach? (I hope the error is not too out of context since it was given on a much larger frame with many more levels.)