I have a file called 'genes' with 4300 lines that looks like
Gene_id chr start stop
GeneA chr1 10 1000
GeneB chr1 2300 7000
GeneC chr1 10000 13577
and another file called 'bases' (~100,000 lines) and looks like
Chr Bases
chr1 160
chr1 157
chr1 8500
chr1 2200
I want to produce a file that keeps the bases that fall into the range between start and stop for each gene
so output would look like
Chr Bases
chr1 160
chr1 157
I have tried this function but it only gives me back the first entry four times:
methC <- apply(bases,1,function(a){
my_bases <- bases[bases[1]==genes$chr & bases[2]>=genes$start & bases[2]<=genes$stop,]
result <- my_bases[,2]
return(result)
})
>methC
# 160 160 160
so I'm missing base 157 and 160 is duplicated 4 times.
If I use
b <- bases[which(bases[1]==genes$chr & bases[2]>=genes$start & bases[2]<=genes$stop),]
> b
# Chr Bases
#chr1 160
I'm still missing 157, but maybe that is due to the order.
However if I try using my real and much larger files with this 'which' function I get back an empty data.frame
> b
Chr Base
<0 rows> (or 0-length row.names)
, so that's why I thought a function would be better to handle large datasets.