How to create dummy variable for intervals

Question

I want to add a column of gwas table that indicate which gene it is based on the start and end position of the gene. How to do this in dplyr?

> gwas
# A tibble: 1,220,764 x 13
   CHROM   POS ID             REF   ALT   A1    TEST  OBS_CT     BETA     SE  T_STAT      P ERRCODE
   <int> <int> <chr>          <chr> <chr> <chr> <chr>  <dbl>    <dbl>  <dbl>   <dbl>  <dbl> <chr>  
 1     1 10511 rs534229142    A     G     A     ADD   461822  0.0825  0.0439  1.88   0.0603 .      
 2     1 15031 rs568188357    A     G     A     ADD   461225  0.0410  0.0803  0.511  0.610  .      
 3     1 15245 rs576044687    T     C     T     ADD   461842  0.0369  0.0444  0.831  0.406  .      
 4     1 46285 1:46285_ATAT_A A     ATAT  A     ADD   461698  0.0205  0.0307  0.666  0.505  .      
 5     1 49315 rs567788405    A     T     A     ADD   460193  0.0576  0.0552  1.04   0.297  .      
 6     1 49318 rs536836601    G     A     G     ADD   460428  0.0297  0.0420  0.708  0.479  .      
 7     1 49343 rs553572247    C     T     C     ADD   461927 -0.00305 0.0360 -0.0846 0.933  .      
 8     1 51047 rs559500163    T     A     T     ADD   462466  0.0459  0.0402  1.14   0.254  .      
 9     1 51049 rs528344458    C     A     C     ADD   462466  0.0459  0.0402  1.14   0.254  .      
10     1 51050 rs551668143    T     A     T     ADD   462466  0.0459  0.0402  1.14   0.254  .      
# … with 1,220,754 more rows
> glist
# A tibble: 2,566 x 4
     chr     start       end gene_name
   <int>     <int>     <int> <chr>    
 1     1  33772366  33786699 A3GALT2  
 2     1  12776117  12788726 AADACL3  
 3     1  12704565  12727097 AADACL4  
 4     1  94458393  94586705 ABCA4    
 5     1 229652328 229694442 ABCB10   
 6     1  94883932  94984219 ABCD3    
 7     1 179068461 179198819 ABL2     
 8     1  76190031  76229363 ACADM    
 9     1   1227763   1243269 ACAP3    
10     1 226332379 226374423 ACBD3    
# … with 2,556 more rows

I defined function

find_gene_name <- function(POS) {
  interval <- glist %>% filter(start<=POS, POS<=end)
  name <- interval$gene_name
  if(length(name)) {
    return(NA)
  } else {
    return(name) 
  }
}

But got the error when using rowwise.

> gwas_annotated <- gwas %>% 
+   rowwise() %>%
+   mutate(gene_name=find_gene_name(POS))
Error: Problem with `mutate()` input `gene_name`.
x Input `gene_name` can't be recycled to size 1.
ℹ Input `gene_name` is `find_gene_name(POS)`.
ℹ Input `gene_name` must be size 1, not 0.
ℹ Did you mean: `gene_name = list(find_gene_name(POS))` ?
ℹ The error occurred in row 1.
Run `rlang::last_error()` to see where the error occurred.

Just an additional column of gwas with name gene_name if the POS is in that interval — monotonic, Dec 17 '20 at 23:55
Fastest way would be data.table's non-equi join. You can also try fuzzyjoins. See this example https://stackoverflow.com/questions/24480031/overlap-join-with-start-and-end-positions — Ronak Shah, Dec 18 '20 at 01:20
won't you also run into the problem that there can be more than one gene interval that the `POS` value in the `gwas` table falls into? — brian avery, Dec 18 '20 at 17:23
one of the solutions in this question should work for u . https://stackoverflow.com/questions/64725878/how-to-group-rows-in-a-range-and-consider-a-3rd-column — StupidWolf, Dec 19 '20 at 04:01
otherwise i would say it's better for you to provide a good input and expected output — StupidWolf, Dec 19 '20 at 04:01

brian avery · Answer 1 · 2020-12-21T23:13:04.957

here is a dummy dataset to make things easier (helpful to solve problems, also good to let people know the possible outcomes, e.g. multiple genes, no genes, etc.):

gwas <- data.frame(chr=rep(1,8), 
                   pos=c(10511,15031,15245,30123,46285,49315,49318,51047),
                   ID=LETTERS[1:8])
glist <- data.frame(chr=rep(1,9), 
                    start=c(12,10250,11237,15000,45500,49010,51001,67000,81000),
                    end=c(900,11113,12545,16208,47123,50097,51987,69000,83000),
                    name=c("kitty","tabby","scratch","spot","princess",
                           "buddy","tiger","rocky","peep"))

> gwas
  chr   pos ID
1   1 10511  A
2   1 15031  B
3   1 15245  C
4   1 30123  D
5   1 46285  E
6   1 49315  F
7   1 49318  G
8   1 51047  H
> glist
  chr start   end     name
1   1    12   900    kitty
2   1 10250 11113    tabby
3   1 11237 12545  scratch
4   1 15000 16208     spot
5   1 45500 47123 princess
6   1 49010 50097    buddy
7   1 51001 51987    tiger
8   1 67000 69000    rocky
9   1 81000 83000     peep

here is a way to solve it using nested for loops which lots of R people hate:

res_names <- vector(mode="character")
for(i in 1:nrow(gwas)){
  for(j in 1:nrow(glist)){
    if(gwas$pos[i] <= glist$end[j] & gwas$pos[i] >= glist$start[j]){
      res_names[i] <- glist$name[j]
    }
  }
}
gwas$loop_gene <- res_names

here is a version of your function to find the gene name:

library(dplyr)
find_gene_name <- function(pos) {
  interval <- glist %>% filter(start<=pos, pos<=end)
  if(nrow(interval)<1){
    gname <- NA # or "none" etc. 
  } else {
    gname <- interval$name
  }
  return(gname)
}

then it can be used in an apply call:

gwas$apply_gene <- sapply(gwas$pos, find_gene_name)

or, as you attempted using rowwise and mutate from dplyr:

gwas_annotated <- gwas %>% 
  rowwise() %>%
  mutate(dplyr_gene=find_gene_name(pos))
gwas_annotated

the way I have it set up in a script, I get all 3 gene name cols with the same names:

> gwas_annotated
# A tibble: 8 x 6
# Rowwise: 
    chr   pos ID    loop_gene apply_gene dplyr_gene
  <dbl> <dbl> <chr> <chr>     <chr>      <chr>     
1     1 10511 A     tabby     tabby      tabby     
2     1 15031 B     spot      spot       spot      
3     1 15245 C     spot      spot       spot      
4     1 30123 D     NA        NA         NA        
5     1 46285 E     princess  princess   princess  
6     1 49315 F     buddy     buddy      buddy     
7     1 49318 G     buddy     buddy      buddy     
8     1 51047 H     tiger     tiger      tiger

How to create dummy variable for intervals

1 Answers1