Assign gene names from another dataframe based on SNP position and gene start/end coordinates

Question

I have two dataframes: one with a list of SNPs and their positions, and another with a list of genes and their start and end coordinates. Using dplyr, I'd like to add a column to the SNP dataframe that has the name of the gene that each SNP falls within (i.e. the position of the SNP is on the same chromosome, and falls between the start/end coordinates, inclusive, of the gene).

If a SNP doesn't fall within any gene coordinates, it should get "NA" in the gene column. The chromosome number between SNP and gene must match. For example, even though the position of the second SNP falls within the start/end coordinates of Gene4, that is not a match because they are on different chromosomes.

SNP dataframe:

CHR  POS  REF  ALT
01   5    C    T
01   10   G    A
02   5    G    T
02   15   C    A
02   20   T    C
03   10   A    G
03   20   C    T

GENE dataframe:

CHR  START  END  GENE_NAME
01   2      8    Gene1
01   12     20   Gene2
01   25     30   Gene3
02   10     18   Gene4
02   25     35   Gene5
03   5      15   Gene6

Desired Output:

CHR  POS  REF  ALT  GENE_NAME
01   5    C    T    Gene1
01   10   G    A    NA
02   5    G    T    NA
02   15   C    A    Gene4
02   20   T    C    NA
03   10   A    G    Gene6
03   20   C    T    NA

Again, I'd like to accomplish this using dplyr. Thanks in advance for any help!

I think the desired output should contain `Gene6` for the SNP at CHR03:10, not `Gene5`. — neilfws, Jun 05 '19 at 04:38
@akrun thanks, I added one, too, and there are many dplyr solution below. — zx8754, Jun 05 '19 at 06:14
Oops, you're right. It should be Gene6. I've edited the question. — nad7wf, Jun 05 '19 at 13:26

Ronak Shah · Accepted Answer · 2019-06-05T04:22:48.753

One option using map2_chr from purrr is to filter the GENE dataframe based on POS and CHR from SNP and select the corresponding GENE_NAME.

library(dplyr)
library(purrr)

SNP %>%
   mutate(GENE_NAME = map2_chr(POS, CHR, function(x, y) {
       inds = x >= GENE$START & x <= GENE$END & y == GENE$CHR
      if (any(inds)) GENE$GENE_NAME[which.max(inds)] else NA
 }))

#  CHR POS REF ALT GENE_NAME
#1   1   5   C   T     Gene1
#2   1  10   G   A      <NA>
#3   2   5   G   T      <NA>
#4   2  15   C   A     Gene4
#5   2  20   T   C      <NA>
#6   3  10   A   G     Gene6
#7   3  20   C   T      <NA>

In base R, this can be written using mapply

mapply(function(x, y) {
   inds = x >= GENE$START & x <= GENE$END & y == GENE$CHR
   if (any(inds)) GENE$GENE_NAME[which.max(inds)] else NA
}, SNP$POS, SNP$CHR)

#[1] "Gene1" NA      NA      "Gene4" NA      "Gene6" NA

data

SNP <- structure(list(CHR = c(1L, 1L, 2L, 2L, 2L, 3L, 3L), POS = c(5L, 
10L, 5L, 15L, 20L, 10L, 20L), REF = c("C", "G", "G", "C", "T", 
"A", "C"), ALT = c("T", "A", "T", "A", "C", "G", "T")), class = 
"data.frame", row.names = c(NA, -7L))

GENE <- structure(list(CHR = c(1L, 1L, 1L, 2L, 2L, 3L), START = c(2L, 
12L, 25L, 10L, 25L, 5L), END = c(8L, 20L, 30L, 18L, 35L, 15L), 
GENE_NAME = c("Gene1", "Gene2", "Gene3", "Gene4", "Gene5", 
"Gene6")), class = "data.frame", row.names = c(NA, -6L))

Thanks to all that helped answering this problem. I accepted this answer because I think it will be the most scalable to the large dataframes I'm working with. — nad7wf, Jun 05 '19 at 17:16

score 3 · Answer 2 · answered Jun 05 '19 at 04:21

Here's one way with dplyr. You simply expand gene dataframe based on START and END and then left_join with snp -

snp %>% 
  left_join(
    gene %>% 
      group_by(CHR, START, END) %>% 
      mutate(
        POS = list(seq(START, END))
      ) %>% 
      unnest(),
    by = c("CHR", "POS")
  ) %>% 
  select(CHR, POS, REF, ALT, GENE_NAME)

  CHR POS REF ALT GENE_NAME
1   1   5   C   T     Gene1
2   1  10   G   A      <NA>
3   2   5   G   T      <NA>
4   2  15   C   A     Gene4
5   2  20   T   C      <NA>
6   3  10   A   G     Gene6
7   3  20   C   T      <NA>

akrun · Answer 3 · 2019-06-05T04:58:44.310

Here is one option using non-equi join with data.table

library(data.table)
setDT(snp)[gene, GENE_NAME := GENE_NAME, on = .(CHR, POS >= START, POS <= END)]
snp
#   CHR POS REF ALT GENE_NAME
#1:   1   5   C   T     Gene1
#2:   1  10   G   A      <NA>
#3:   2   5   G   T      <NA>
#4:   2  15   C   A     Gene4
#5:   2  20   T   C      <NA>
#6:   3  10   A   G     Gene6
#7:   3  20   C   T      <NA>

Or with fuzzyjoin

library(fuzzyjoin)
library(dplyr)
fuzzy_left_join(snp, gene, by = c("CHR", "POS" = "START",
   "POS" = "END"), match_fun = list(`==`, `>=`, `<=`)) %>% 
   select(CHR = CHR.x, POS, REF, ALT, GENE_NAME)
#  CHR POS REF ALT GENE_NAME
#1   1   5   C   T     Gene1
#2   1  10   G   A      <NA>
#3   2   5   G   T      <NA>
#4   2  15   C   A     Gene4
#5   2  20   T   C      <NA>
#6   3  10   A   G     Gene6
#7   3  20   C   T      <NA>

Or an option with rap

library(rap)
snp %>% 
   rap(GENE_NAME = ~ filter(gene, CHR ==  !!CHR, START <= POS, END >= POS) %>% 
          pull(GENE_NAME)) %>% 
   mutate(GENE_NAME = replace(GENE_NAME, !lengths(GENE_NAME), NA)) %>% 
   unnest
#  CHR POS REF ALT GENE_NAME
#1   1   5   C   T     Gene1
#2   1  10   G   A      <NA>
#3   2   5   G   T      <NA>
#4   2  15   C   A     Gene4
#5   2  20   T   C      <NA>
#6   3  10   A   G     Gene6
#7   3  20   C   T      <NA>

data

snp <- structure(list(CHR = c(1L, 1L, 2L, 2L, 2L, 3L, 3L), POS = c(5L, 
10L, 5L, 15L, 20L, 10L, 20L), REF = c("C", "G", "G", "C", "T", 
"A", "C"), ALT = c("T", "A", "T", "A", "C", "G", "T")),
  class = "data.frame", row.names = c(NA, 
-7L))

gene <- structure(list(CHR = c(1L, 1L, 1L, 2L, 2L, 3L), START = c(2L, 
12L, 25L, 10L, 25L, 5L), END = c(8L, 20L, 30L, 18L, 35L, 15L), 
    GENE_NAME = c("Gene1", "Gene2", "Gene3", "Gene4", "Gene5", 
    "Gene6")), class = "data.frame", row.names = c(NA, -6L))

Assign gene names from another dataframe based on SNP position and gene start/end coordinates

3 Answers3

data