I have two data.tables, A and B. Data.table A has two columns, "chrom" and "pos", while B represents a series of intervals read from a BED file. I want to add a new column called "select_status" to data.table A. If a row's "pos" falls within any interval in B, the corresponding value in "select_status" should be set to TRUE; otherwise, it should be set to FALSE.
Here is an example to illustrate the data structures:
library(data.table)
A <- data.table(chrom = c("chr1", "chr2", "chr3", "chr3", "chr3"),
pos = c(100, 200, 300, 391, 399))
B <- data.table(chrom = c("chr1", "chr2", "chr2", "chr3", "chr3", "chr3"),
start = c(150, 180, 250, 280, 390, 600),
end = c(200, 220, 300, 320, 393, 900))
# I need add a col select_status to A, and set it to Ture if pos in B
# I want someting like this but this is wrong
A[, select_status := any(pos >= B$start & pos <= B$end & chrom == B$chrom)]
or
A[, select_status := sapply(.SD, function(x) any(x >= B$start & x <= B$end)), .SDcols = c("pos"), by = .(chrom)]
A[is.na(select_status), select_status := FALSE]
My solution is not work because Its not compare pos and region match by row in B, pos chr3 399
will also be set to TURE
I know that I can use apply
to walk through A line by line and then apply the result of the walk to B as a filter to achieve similar results, but this is slower in cases where the data has many rows, and I wonder if there is another, more concise method
I expected results
A
chrom pos select_status
1: chr1 100 FALSE
2: chr2 200 TRUE
3: chr3 300 TRUE
4: chr3 391 TRUE
5: chr3 399 FALSE