1

A simpler version of the original question which I asked but nobody answered it yet.

I have a huge input file (a representative sample of which is shown below as input):

> input
           CT1           CT2           CT3
1 chr1:200-400  chr1:250-450  chr1:400-800
2 chr1:800-970  chr2:200-500  chr1:700-870
3 chr2:300-700 chr2:600-1000 chr2:700-1400

I want to process it by following a rule (described below) so that I get an output like:

 > output
              CT1 CT2 CT3
chr1:200-400    1   1   0
chr1:800-970    1   0   1
chr2:300-700    1   1   0
chr1:250-450    1   1   1
chr2:200-500    1   1   0
chr2:600-1000   1   1   1
chr1:400-800    0   1   1
chr1:700-870    1   0   1
chr2:700-1400   0   1   1

Rule: Take every index (the first in this case is chr1:200-400) of the dataframe, see if it overlaps with any other value in the dataframe. If yes, write 1 below that column in which it exists, if not write 0.

For example, if we take 1st index of the input input[1,1] which is chr1:200-400. As it exists in column 1 we will write 1 below it. Now we will check if this range overlap with any other range which exists in any of the other columns in the input. This value overlaps only with the first value (chr1:250-450) of the second column (CT2), therefore, we write 1 below that as well. As there is no overlap with any of the values in CT3, we write 0 below CT3 in the output dataframe.

Here are the dput of input and output:

> dput(input)
structure(list(CT1 = structure(1:3, .Label = c("chr1:200-400", 
"chr1:800-970", "chr2:300-700"), class = "factor"), CT2 = structure(1:3, .Label = c("chr1:250-450", 
"chr2:200-500", "chr2:600-1000"), class = "factor"), CT3 = structure(1:3, .Label = c("chr1:400-800", 
"chr1:700-870", "chr2:700-1400"), class = "factor")), .Names = c("CT1", 
"CT2", "CT3"), class = "data.frame", row.names = c(NA, -3L))
> dput(output)
structure(list(CT1 = c(1L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L), CT2 = c(1L, 
0L, 1L, 1L, 1L, 1L, 0L, 1L, 1L), CT3 = c(0L, 0L, 0L, 0L, 0L, 
1L, 1L, 1L, 1L)), .Names = c("CT1", "CT2", "CT3"), class = "data.frame", row.names = c("chr1:200-400", 
"chr1:800-970", "chr2:300-700", "chr1:250-450", "chr2:200-500", 
"chr2:600-1000", "chr1:400-800", "chr1:700-870", "chr2:700-1400"
))
Jaap
  • 81,064
  • 34
  • 182
  • 193
Newbie
  • 411
  • 5
  • 18
  • So this is the exact copy of your other question? – zx8754 Jan 11 '18 at 12:58
  • Not the exact copy, there the rule was bit complicated, where I want `1` for only those overlaps which are more than 50% of the total range. Here I am asking only for the overlaps. If there is any overlap, assign 1. – Newbie Jan 11 '18 at 13:03
  • 1. convert wide-to-long, 2. parse the strings into 3 columns "chr, start, end". 3. split on CTs, 4. merge on overlap, [example1](https://stackoverflow.com/questions/23084322), [example2](https://stackoverflow.com/questions/24480031/roll-join-with-start-end-window) etc – zx8754 Jan 11 '18 at 13:08
  • you would need to define 'overlap' because I cannot understand the allocation of zeros and ones in your output - I would expect something else – RolandASc Jan 11 '18 at 13:24
  • @RolandASc Overlap between `input[1,1]` (200-400) and `input[1,2]` (250-400) is 150. Additinoally `chr` should be same, which is `chr1` in this case. And there is no overlap of `input[1,1]` region with any of the `input[,3]` regions as for chr1 they start from 400. Does it make sense now? – Newbie Jan 11 '18 at 13:29
  • shouldn't row 2 CT3 be 1 then? or maybe this is the output table from your referenced first question with the 50% hurdle? – RolandASc Jan 11 '18 at 13:54
  • @RolandASc Yes, you are right. I have made the output compatible to this question. – Newbie Jan 11 '18 at 14:02

1 Answers1

3

A possible solution using the data.table-package:

# load the 'data.table'-package and convert 'input' to a data.table with 'setDT'
library(data.table)
setDT(input)

# reshape 'input' to long format and split the strings in 3 columns
DT <- melt(input, measure.vars = 1:3)[, c('chr','low','high') := tstrsplit(value, split = ':|-', type.convert = TRUE)
                                      , by = variable][]

# create aggregation function; needed in the ast reshape step
f <- function(x) as.integer(length(x) > 0)

# cartesian self join & reshape result back to wide format with aggregation function
DT[DT, on = .(chr, low < high, high > low), allow.cartesian = TRUE
   ][, dcast(.SD, value ~ i.variable, fun = f)]

which gives:

           value CT1 CT2 CT3
1:  chr1:200-400   1   1   0
2:  chr1:250-450   1   1   1
3:  chr1:400-800   0   1   1
4:  chr1:700-870   1   0   1
5:  chr1:800-970   1   0   1
6:  chr2:200-500   1   1   0
7:  chr2:300-700   1   1   0
8: chr2:600-1000   1   1   1
9: chr2:700-1400   0   1   1
Jaap
  • 81,064
  • 34
  • 182
  • 193
  • Thanks a lot for suggesting a detailed solution. I think there is a problem here, for region `chr2:200-500`, it overlaps with `chr2:300-700` of `CT1` but the value in output is 0 for this. – Newbie Jan 11 '18 at 14:31
  • @Newbie I thought the check for overlap should be done row. If I understand you correctly that isn't necessary. See the update. HTH – Jaap Jan 11 '18 at 14:44
  • Now it is working fine. I think you removed the `rn` part in the update. Can you please suggest me how can I put one more condition for selecting the overlap. i.e. Assign 1 only when any one of the two ranges being compared is overlapping with more than 50% of its entire range. For example overlap between 200-400 & 300-450 will be considered one (actual overlap is 100 which is more than 50% of the entire 2nd range which is 150) but overlap between 200-400 & 350-550 will be considered 0. – Newbie Jan 11 '18 at 14:56
  • @Newbie will try to look at it tonight – Jaap Jan 11 '18 at 15:46
  • There is a problem with this code when I apply it on actual data of dimension (66186*23) which also contains NA. I get the results for first three columns only. I have checked the class of my actual input data and it is the same (`"data.table" "data.frame"`) as the input sample provided here. Can you please guide me what is going wrong here. Thank you. – Newbie Jan 12 '18 at 10:06
  • I fixed it by changing `measure.vars = 1:23` and then deleting the NA values by `DT <- na.omit(DT)` – Newbie Jan 12 '18 at 10:21
  • @Newbie It is probably better to use the `na.rm = TRUE` parameter inside `melt` than `na.omit` afterwards. [See also this answer for an example](https://stackoverflow.com/a/25856135/2204410). – Jaap Jan 12 '18 at 10:47
  • @Newbie I haven't found the time to look at your other problem. Will do that tomorrow (or maybe this evening). I will definitely have time then. – Jaap Jan 12 '18 at 10:49
  • I've got one answer for my that question (it is mentioned as an original question in the first line of this post). It works on given input sample but due to hard-coded variable names I couldn't apply it on actual data. Maybe its better to just look at that solution and suggest an edit to make it generalized, so that it can be directly applied to the input dataframe. Thanks. – Newbie Jan 12 '18 at 11:24