1

I have a file called 'genes' with 4300 lines that looks like

Gene_id  chr  start   stop              
GeneA chr1  10  1000                 
GeneB chr1  2300  7000                     
GeneC chr1 10000 13577                

and another file called 'bases' (~100,000 lines) and looks like

Chr Bases          
chr1 160           
chr1 157             
chr1 8500           
chr1 2200               

I want to produce a file that keeps the bases that fall into the range between start and stop for each gene

so output would look like

Chr Bases             
chr1 160            
chr1 157                

I have tried this function but it only gives me back the first entry four times:

methC <- apply(bases,1,function(a){
my_bases <- bases[bases[1]==genes$chr & bases[2]>=genes$start & bases[2]<=genes$stop,]
result <- my_bases[,2]
return(result)
})

>methC
# 160 160 160

so I'm missing base 157 and 160 is duplicated 4 times.

If I use

b <- bases[which(bases[1]==genes$chr & bases[2]>=genes$start & bases[2]<=genes$stop),]
> b
 #  Chr Bases
#chr1   160

I'm still missing 157, but maybe that is due to the order.

However if I try using my real and much larger files with this 'which' function I get back an empty data.frame

> b
Chr        Base       
<0 rows> (or 0-length row.names)

, so that's why I thought a function would be better to handle large datasets.

David Arenburg
  • 91,361
  • 17
  • 137
  • 196
user3816990
  • 247
  • 1
  • 2
  • 10
  • 2
    May be this link helps http://stackoverflow.com/questions/27619381/data-frame-lookup-value-in-range-and-return-different-column or http://stackoverflow.com/questions/19748535/finding-overlapping-ranges-between-two-interval-data/19752720#19752720 – akrun Jan 25 '15 at 17:09
  • And this: http://stackoverflow.com/questions/24480031/roll-join-with-start-end-window/25655497#25655497 – Henrik Jan 25 '15 at 18:07

1 Answers1

2

I would use the data table library:

library("data.table")

# Read the data from file
genes <- data.table(read.csv("genes.csv"))
bases <- data.table(read.csv("bases.csv"))

# Define the columns that are used to join the two tables
setkey(genes, chr)
setkey(bases, Chr)

# The first line joins the two tables, and the second line
# filters the result to keep only those that match and defines
# the columns that are to be shown in the output. Note the need
# to take care with capitalisation and the reserved word 'stop'
genes[bases, allow.cartesian = TRUE][
  Bases >= start & Bases <= `stop`, list(Chr = chr, Bases)]

Which produces this result:

    Chr Bases
1: chr1   160
2: chr1   157
  • You can use `fread()` to read in the file. And `foverlaps()` to perform overlapping range joins. – Arun Jan 26 '15 at 22:25