-1

I have two dataframes such as:

gene_bacteriadf

 seqnames    ranges strand
  [1] scaffold_1      1-50      -
  [2] scaffold_1    60-100      -
  [3] scaffold_1   200-350      -
  [4] scaffold_2 1550-1650      +
  [5] scaffold_2 1900-2300      -
  [6] scaffold_5   250-255      +` 

and overlapdf

seqnames    ranges strand hit with_busco with_bacteria Overlap_with 
scaffold_2 1550-1650      + |      TRUE       101        201        101 0.502487562189055  

and the idea is simply to remove the matching in columns seqnames, ranges and strand. I tried;

genes_bacteriadf[!(alist(genes_bacteriadf$seqnames, genes_bacteriadf$start, genes_bacteriaf$end, genes_bacteriadf$width) %in% (alistoverlapsdf$seqnames,overlapsdf$start,overlapsdf$end,overlapsdf$width), ]

But id does not work.

Here in the exemple scaffold2 1550 165à does match so I should get a new df such as:

seqnames    ranges strand

  [1] scaffold_1      1-50      -
  [2] scaffold_1    60-100      -
  [3] scaffold_1   200-350      -
  [5] scaffold_2 1900-2300      -
  [6] scaffold_5   250-255      +

Does someone have an idea?

s_baldur
  • 29,441
  • 4
  • 36
  • 69
bewolf
  • 165
  • 9

1 Answers1

1

This calls for dplyr's anti_join, especially with the same column names.

library(dplyr)

gene_bacteriadf %>% 
  anti_join(overlapdf)

Joining, by = c("seqnames", "ranges", "strand")
    seqnames    ranges strand
1 scaffold_1      1-50      -
2 scaffold_1    60-100      -
3 scaffold_1   200-350      -
4 scaffold_2 1900-2300      -
5 scaffold_5   250-255      +
phiver
  • 23,048
  • 14
  • 44
  • 56