2

I would like to filter a data frame based on the values present in a second data frame.

For example, match the rows from the first data frame that, in the column "BP", are higher than the first value of the "start_pos" column and smaller than "end_pos" column or just smaller than "end_pos" in the second data frame.

I need to repeat this procedure for all the values in the second data frame. Currently, I am performing these using a for loop. However, I would like to do it in a single command.

Data frame 1

CHR       BP
29   836019
29  4417047
29  7589996
29 11052921
29 14009294
29 33174196

Data frame 2

start_pos end_pos            gene_id
19774   19899 ENSBTAG00000046619
34627   35558 ENSBTAG00000006858
69695   71121 ENSBTAG00000039257
83323   84281 ENSBTAG00000035349
124849  179713 ENSBTAG00000001753
264298  264843 ENSBTAG00000005540
for(j in 1:nrow(tmp_markers)){

      temp_out_markers<- tmp_markers[j,]
      tmp_search<-tmp_gene[which((tmp_markers[j,"BP"]>=tmp_gene[,"start_pos"] & tmp_markers[j,"BP"]<= tmp_gene[,"end_pos"]) | (tmp_markers[j,"BP"]+interval>=tmp_gene[,"start_pos"] & tmp_markers[j,"BP"]+interval <=tmp_gene[,"end_pos"]) | (tmp_markers[j,"BP"]+interval>=tmp_gene[,"start_pos"] & tmp_markers[j,"BP"]+interval <=tmp_gene[,"end_pos"]) | (tmp_markers[j,"BP"]+interval>=tmp_gene[,"start_pos"] & tmp_markers[j,"BP"]+interval >=tmp_gene[,"end_pos"]& tmp_markers[j,"BP"]<=tmp_gene[,"start_pos"])| (tmp_markers[j,"BP"]-interval<=tmp_gene[,"end_pos"] & tmp_markers[j,"BP"]-interval >=tmp_gene[,"start_pos"])|(tmp_markers[j,"BP"]-interval<=tmp_gene[,"end_pos"]  &  tmp_markers[j,"BP"]-interval<=tmp_gene[,"start_pos"] &  tmp_markers[j,"BP"]>=tmp_gene[,"end_pos"])),]


      if(nrow(tmp_search)>0){                     
        temp_out<-cbind(temp_out_markers[rep(seq_len(nrow(tmp_search))),],tmp_search)
        temp_out[,"Distance_from_gene_start"]<-temp_out[,"BP"]-temp_out[,"start_pos"]
        temp_out[,"Distance_from_gene_end"]<-temp_out[,"BP"]-temp_out[,"end_pos"]
        output_genes<-rbind(temp_out,output_genes)
      }
    }

At the end, I want a data frame with all the rows that are within my tested intervals.

PavoDive
  • 6,322
  • 2
  • 29
  • 55
  • 1
    Could you include code for a minimal dataset to illustrate an example? – Gregory Sep 10 '19 at 18:35
  • 1
    Away from my computer at the moment, but take a look at `data.table` non-equi joins, it seems it's what you want. https://github.com/Rdatatable/data.table/wiki or initiate a chat if you want a Spanish version of it – PavoDive Sep 10 '19 at 18:35
  • how you data frames look? paste few line headers – kashiff007 Sep 10 '19 at 18:37
  • 1
    Adding to @PavoDive's comment, `data.table::foverlaps` is one (of a couple) within that package that do joins on inequality. I don't know of any other easy mechanism or package within R that does inequality joins (they might exist), so my guess is that that is going to be in any solution. (And like Gregory said, we can't help much more without seeing sample data, perhaps from `dput(head(x))`. See https://stackoverflow.com/questions/5963269, https://stackoverflow.com/help/mcve, and https://stackoverflow.com/tags/r/info.) – r2evans Sep 10 '19 at 18:49
  • Thank you all! I provided some few lines as an example from both datasets. I will take a look at data.table in this meantime. – Pablo Fonseca Sep 10 '19 at 18:55
  • Let me see if I understand this well: you want to match to a row in `df1` the corresponding row in `df2` for which `BP >= start_pos & BP <= end_pos`? If so, none of the values in your example will match, and it will return an empty data.frame (the smallest `BP`, 836019 is larger than the largest `end_pos`m 264843) – PavoDive Sep 11 '19 at 01:11

2 Answers2

0

As I stated in a comment, your mock data won't result in a match, as the smallest BP value (836019) is larger than the largest end_pos (264843).

It could be also that I misunderstood altogether your problem!

I understand that you want to match the rows in df1 to those in df2 for which BP >= start_pos and BP <= end_pos. If it's so, we can achieve that using the non-equi joins provided by package data.table.

library(data.table)

result <- dt1[dt2, 
              .(BP, CHR, gene_id), 
              on = .(BP >= start_pos, BP <= end_pos), 
              nomatch = NULL, 
              by = .EACHI]

setnames(result, 1:2, names(dt2)[1:2])

result
   start_pos   end_pos BP CHR gene_id
1:  0.000000  2.000000  0  29  ABCD01
2:  4.571429  6.571429  6  30  ABCD03
3: 11.428571 13.428571 12  31  ABCD06
4: 16.000000 18.000000 18  32  ABCD08
5: 22.857143 24.857143 24  33 ABCD011
6: 29.714286 31.714286 30  34 ABCD014

In case you need the full 15 rows of dt2, simply omit the nomatch = NULL part.

DATA USED:

dt1 <- data.table(CHR = 29:34, 
                  BP = seq(0, 30, length.out = 6), 
                  key = "BP")

dt2 <- data.table(start_pos = seq(0, 32, length.out = 15), 
                  gene_id = paste0("ABCD", rep(0, 3), 1:15))

dt2[, end_pos := start_pos + 2]

setcolorder(dt2, c(1, 3, 2))

Alternative with foverlaps

As @r2evans mentioned in a comment, data.table has another function, foverlaps than can be useful here. It checks if a range overlaps with one in another table, so we need to do a small trick to create a 0-width range in dt1:

dt1[, BP2 := BP]

We also need to have keyed data.tables:

setkey(dt1, "BP", "BP2")
setkey(dt2, "start_pos", "end_pos")

And then pass it to the working horse:

foverlaps(dt1, dt2)

   start_pos   end_pos gene_id CHR BP BP2
1:  0.000000  2.000000  ABCD01  29  0   0
2:  4.571429  6.571429  ABCD03  30  6   6
3: 11.428571 13.428571  ABCD06  31 12  12
4: 16.000000 18.000000  ABCD08  32 18  18
5: 22.857143 24.857143 ABCD011  33 24  24
6: 29.714286 31.714286 ABCD014  34 30  30

Of course we can get rid of BP2 later on by BP2 := NULL.

If we want the full 15 rows of dt2, the it's just inverting the order of the objects in the call:

foverlaps(dt2, dt1)
PavoDive
  • 6,322
  • 2
  • 29
  • 55
0

Thank you very much!

I ended with this solution and it is working very well.

foverlaps(tmp_gene, tmp_markers, by.x = c("start_pos","end_pos"), by.y = 
key(tmp_markers),nomatch = 0)

Cheers.