1

I have two data.tables, A and B. Data.table A has two columns, "chrom" and "pos", while B represents a series of intervals read from a BED file. I want to add a new column called "select_status" to data.table A. If a row's "pos" falls within any interval in B, the corresponding value in "select_status" should be set to TRUE; otherwise, it should be set to FALSE.

Here is an example to illustrate the data structures:

library(data.table)

A <- data.table(chrom = c("chr1", "chr2", "chr3", "chr3", "chr3"),
                pos = c(100, 200, 300, 391, 399))
B <- data.table(chrom = c("chr1", "chr2", "chr2", "chr3", "chr3", "chr3"),
                start = c(150, 180, 250, 280, 390, 600),
                end = c(200, 220, 300, 320, 393, 900))

# I need add a col select_status to A, and set it to Ture if pos in B
# I want someting like this but this is wrong

A[, select_status := any(pos >= B$start & pos <= B$end & chrom == B$chrom)]

or

A[, select_status := sapply(.SD, function(x) any(x >= B$start & x <= B$end)), .SDcols = c("pos"), by = .(chrom)]

A[is.na(select_status), select_status := FALSE]

My solution is not work because Its not compare pos and region match by row in B, pos chr3 399 will also be set to TURE

I know that I can use apply to walk through A line by line and then apply the result of the walk to B as a filter to achieve similar results, but this is slower in cases where the data has many rows, and I wonder if there is another, more concise method

I expected results

A
   chrom pos select_status
1:  chr1 100         FALSE
2:  chr2 200          TRUE
3:  chr3 300          TRUE
4:  chr3 391          TRUE
5:  chr3 399          FALSE
zhang
  • 185
  • 7
  • 1
    Can you show the intended result? I'm guessing `A[B, on=.(chrom, pos>=start, pos<=end), flag := TRUE]` should work? – thelatemail May 22 '23 at 03:26
  • Thanks, I updated it, yeah its seems correctly set `TRUE` and set `FALSE` as `NA`, But it's also enough for me, I am not familiar with `data.table`, could you please share with me some details of the expr? – zhang May 22 '23 at 03:31
  • 1
    Essentially this is joining A & B on the `chrom` variable plus the criteria of `pos` to be between `start` and `end` - anything that matches gets set to `TRUE` - if there are multiple matches it doesn't make any difference as only one row will be output. – thelatemail May 22 '23 at 03:39

1 Answers1

1

Here is an approach that can be considered :

library(data.table)

A <- data.table(chrom = c("chr1", "chr2", "chr3", "chr3", "chr3"),
                pos = c(100, 200, 300, 391, 399))

B <- data.table(chrom = c("chr1", "chr2", "chr2", "chr3", "chr3", "chr3"),
                start = c(150, 180, 250, 280, 390, 600),
                end = c(200, 220, 300, 320, 393, 900))

X_Val <- eval(parse(text = paste0("c(",  paste0(paste0(B$start, ":", B$end), collapse = ","), ")")))
A[["select_status"]] <- ifelse(A$pos %in% X_Val, TRUE, FALSE)

 A
   chrom pos select_status
1:  chr1 100         FALSE
2:  chr2 200          TRUE
3:  chr3 300          TRUE
4:  chr3 391          TRUE
5:  chr3 399         FALSE
Emmanuel Hamel
  • 1,769
  • 7
  • 19