Adding a new column of gene names to a dataframe based on searching from a lookup table

Question

I am very new to working within R and spent most of yesterday trying to figure this out from other questions and guides here without getting very far.

I have a table of RNAseq expression values with their transcript_id (eg TraesCS5B02G127900.1). I want to add a new column to this table from searching a lookup table for the gene_id within the transcript_id (eg TraesCS5B02G127900) and adds the respective Gene.name (eg ENA1) and gene_id in new columns.

Here is part of my expression table (contains many more rows):

          transcript_id sigma_sq_pmax   iqr failed_ise 
1  TraesCS1B02G306500.1  9.259216e-01  TRUE      FALSE
2  TraesCS1B02G433800.1  0.000000e+00 FALSE      FALSE
3  TraesCS2A02G284700.1  0.000000e+00 FALSE      FALSE
4  TraesCS2A02G336600.1  1.582308e-02 FALSE      FALSE
5  TraesCS2A02G390800.1  2.236139e-01 FALSE      FALSE
6  TraesCS2A02G391000.1  4.609996e-01  TRUE      FALSE
7  TraesCS2A02G391000.2  4.378863e+00 FALSE      FALSE
8  TraesCS2A02G391000.3  0.000000e+00 FALSE      FALSE
9  TraesCS2A02G410400.1  0.000000e+00 FALSE      FALSE
10 TraesCS2A02G424200.2  3.831211e-01  TRUE      FALSE

Here is part of my look up table which has my gene_id and gene.names (contains many more rows again):

           Gene.name            gene_id
1               ENA1 TraesCS5B02G127900
2               ENA1 TraesCS5D02G136800
3               ENA1 TraesCS5A02G129300
4  ENA2/OS10G0434900 TraesCS7A02G427700
5  ENA2/OS10G0434900 TraesCS4A02G414400
6  ENA2/OS10G0434900 TraesCS4B02G315100
7  ENA2/OS10G0434900 TraesCS4D02G332100
8  ENA2/OS10G0434900 TraesCS4B02G336400
9  ENA2/OS10G0434900 TraesCS7D02G420100
10 ENA2/OS10G0434900 TraesCS7B02G327900

I want to make a new table that adds the Gene.name and gene_id column onto the expression table by searching the transcript_id which includes the gene_id inside it.

For example this is what I want it to look like:

   Gene.name            gene_id            target_id sigma_sq_pmax   iqr failed_ise
1     OsZIP5 TraesCS1B02G306500 TraesCS1B02G306500.1    0.92592155  TRUE      FALSE
2    OsABCC1 TraesCS1B02G433800 TraesCS1B02G433800.1    0.00000000 FALSE      FALSE
3     OsYSL6 TraesCS2A02G284700 TraesCS2A02G284700.1    0.00000000 FALSE      FALSE
4     OsVIT1 TraesCS2A02G336600 TraesCS2A02G336600.1    0.01582308 FALSE      FALSE
5    OsYSL16 TraesCS2A02G390800 TraesCS2A02G390800.1    0.22361394 FALSE      FALSE
6     OsYSL9 TraesCS2A02G391000 TraesCS2A02G391000.1    0.46099961  TRUE      FALSE
7     OsYSL9 TraesCS2A02G391000 TraesCS2A02G391000.2    4.37886316 FALSE      FALSE
8     OsYSL9 TraesCS2A02G391000 TraesCS2A02G391000.3    0.00000000 FALSE      FALSE
9     OsHMA5 TraesCS2A02G410400 TraesCS2A02G410400.1    0.00000000 FALSE      FALSE
10    OsZIP3 TraesCS2A02G424200 TraesCS2A02G424200.2    0.38312110  TRUE      FALSE

I'd really appreciate any help on this, thank you :)

If I am not wrong, there is no value as `OsZIP5` in either of the lookup or expression table! So, where does the column `Gene.name` come from? — massisenergy, Apr 18 '20 at 16:01

score 1 · Accepted Answer · answered Apr 18 '20 at 17:36

A tidyverse approach may look like so. First step involves separating transcript_id in a column containing the gene_id and the no following the . using tidyr::separate. In the second step you can join the expression and look up tables by gene_id using dplyr::left_join. Not however, that the Gene.name column contains only NAs as there where not matches in the given example data.

library(dplyr)
library(tidyr)

expression_table %>% 
  tidyr::separate(transcript_id, into = c("gene_id", "no"), sep = "\\.", remove = FALSE) %>% 
  dplyr::left_join(lookup_table)
#> Joining, by = c("row", "gene_id")
#> Warning: Column `gene_id` joining character vector and factor, coercing into
#> character vector
#>    row        transcript_id            gene_id no sigma_sq_pmax   iqr
#> 1    1 TraesCS1B02G306500.1 TraesCS1B02G306500  1    0.92592160  TRUE
#> 2    2 TraesCS1B02G433800.1 TraesCS1B02G433800  1    0.00000000 FALSE
#> 3    3 TraesCS2A02G284700.1 TraesCS2A02G284700  1    0.00000000 FALSE
#> 4    4 TraesCS2A02G336600.1 TraesCS2A02G336600  1    0.01582308 FALSE
#> 5    5 TraesCS2A02G390800.1 TraesCS2A02G390800  1    0.22361390 FALSE
#> 6    6 TraesCS2A02G391000.1 TraesCS2A02G391000  1    0.46099960  TRUE
#> 7    7 TraesCS2A02G391000.2 TraesCS2A02G391000  2    4.37886300 FALSE
#> 8    8 TraesCS2A02G391000.3 TraesCS2A02G391000  3    0.00000000 FALSE
#> 9    9 TraesCS2A02G410400.1 TraesCS2A02G410400  1    0.00000000 FALSE
#> 10  10 TraesCS2A02G424200.2 TraesCS2A02G424200  2    0.38312110  TRUE
#>    failed_ise Gene.name
#> 1       FALSE      <NA>
#> 2       FALSE      <NA>
#> 3       FALSE      <NA>
#> 4       FALSE      <NA>
#> 5       FALSE      <NA>
#> 6       FALSE      <NA>
#> 7       FALSE      <NA>
#> 8       FALSE      <NA>
#> 9       FALSE      <NA>
#> 10      FALSE      <NA>

^{Created on 2020-04-18 by the reprex package (v0.3.0)}

This is exactly what I was looking for, thank you so much! I will go and read up more about tidyr and dplyr :) — can-ask, Apr 20 '20 at 11:31

Adding a new column of gene names to a dataframe based on searching from a lookup table

1 Answers1