0

I have a question similar to this one:

Add column(Annotate) one data frame based on column from other data frame in R

Where I have two files as such:

head(geno)

chr     Pos             Val
X       129271111       10
X       129271112       10
X       129271113       10
X       129271114       10
X       129271115       10

And

head(anno)

chr Annot  start       end         
X   promo  129271053   129271110   
X   Fx12   129270618   129270706   
X   Txw    129270020   129270160   
X   Tx3    129267288   129267430   
X   Tx5    129265650   129265774  

And I would like to add a 4th column (Annot from the second file) to the first file based on whether the Pos variable lies between the start and end of the second file.

The only thing I can think of is using a loop to do this and that seems like a real waste of time as the files are so large.

Can someone please provide some advise on how to do this in a tidyverse/dplyr way?

Thank you,

h3ab74
  • 318
  • 3
  • 16

2 Answers2

2

Here is one way - assuming that the start and end ranges don't overlap.

library(dplyr)
library(purrr)

geno %>%
  mutate(Annot = anno$Annot[map_dbl(Pos, ~ which(.x >= anno$start & .x <= anno$end) %>% 
                                      replace(length(.) == 0, NA))])
Ritchie Sacramento
  • 29,890
  • 4
  • 48
  • 56
1

There is a way to do this in base R which doesn't require loops (including the loops hidden away inside apply type functions). That is to use a combination of outer and which. Again, it depends on there being a unique answer for each row of geno.

I have altered your example slightly, since none of the Pos values actually match any of the start - end ranges in the data in the initial question.

Data:

geno
#>   chr       Pos Val
#> 1   X 129265661  10
#> 2   X 129267289  10
#> 3   X 129270628  10
#> 4   X 129270033  10
#> 5   X 129271073  10

anno
#>   chr Annot     start       end
#> 1   X promo 129271053 129271110
#> 2   X  Fx12 129270618 129270706
#> 3   X   Txw 129270020 129270160
#> 4   X   Tx3 129267288 129267430
#> 5   X   Tx5 129265650 129265774

Code:

geno$Annot <- anno$Annot[which(outer(anno$end, geno$Pos, `>`) & 
                               outer(anno$start, geno$Pos, `<`), 
                               arr.ind = TRUE)[, 1]]

Result

geno
#>   chr       Pos Val Annot
#> 1   X 129265661  10   Tx5
#> 2   X 129267289  10   Tx3
#> 3   X 129270628  10  Fx12
#> 4   X 129270033  10   Txw
#> 5   X 129271073  10 promo
Allan Cameron
  • 147,086
  • 7
  • 49
  • 87
  • @h3ab74 I guess this means that geno$Pos contains values that are within the range of multiple rows of your lookup. Have you checked this? – Allan Cameron Mar 18 '20 at 10:35