-4

I have a dataset of genes associated with drugs derived from DrugBank. I wish to simply translate all the drugbank IDs to drug names readable by a human. As you can see my main problem is that some genes are linked to multiple or even hundreds of drugs. the multiple Drug IDs are in the same comma-delimited "column" The R studio "match" or "merge function" only work for the first identifier in each column, thus effectively deleting the remainder in the same column "cell". I have found ways to do this manually on excel for my top candidates but it is not realistic for my dataset of 3000 genes.

Ideally, I would like to do something like "text to columns" but in rows, so every row would keep all its other values but only one of the multiple drugbank IDs in the cell, and then can just use the match function to replace them.

The drugbank vocabulary (.csv) looks like this: [DBvocabulary.csv]

DrugBank.ID Common.name

DB00001 Lepirudin

DB00002 Cetuximab

DB00003 Dornase alfa

DB00004 Denileukin diftitox

DB00005 Etanercept

DB00006 Bivalirudin

My dataset (.csv) has 15 columns but the important ones are:

[all_ph_active.csv]

Gene.Name DrugBank.ID

F8 DB09130

TCN2 DB00200

LDLR DB09270; DB11251; DB14003

ALB DB00070; DB00137; DB00159; DB00162; DB00214;

Any advice is welcome, thanks in advance!

Aris77
  • 1
  • 2
  • can you update with the expected output – akrun Apr 03 '21 at 20:12
  • 1
    Please include a [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) of your data using ``dput()``. – user438383 Apr 03 '21 at 20:28
  • 1
    Please see below how Joel has provided data for `Translation` and `df` which can be copied and used by anybody. Try to provide data in such reproducible format so that it is easier and faster to help. Read about [how to give a reproducible example](http://stackoverflow.com/questions/5963269) . – Ronak Shah Apr 04 '21 at 02:53
  • Thank you Ronak, yes I saw and managed to use it. Will format my question better next time thanks – Aris77 Apr 04 '21 at 11:37

1 Answers1

0

One way to do this would be to join the name column to the original dataframe.

I have provided a small example below

library(tidyverse)

Translation <- tribble(~"ID", ~"Name",
                 "I001", "name1",
                 "I002", "name2",
                 "I003", "name3",
                 "I004", "name4",)


df <- tribble(~"ID",
              "I001",
              "I001",
              "I004",
              "I004",
              "I002",
              "I002",
              "I001",
              "I002",
              "I003",
              "I003",
              "I004",
              "I002",
              "I001"
              )
                  
right_join(df, Translation, by=c("ID" ="ID"))
#> # A tibble: 13 x 2
#>    ID    Name 
#>    <chr> <chr>
#>  1 I001  name1
#>  2 I001  name1
#>  3 I004  name4
#>  4 I004  name4
#>  5 I002  name2
#>  6 I002  name2
#>  7 I001  name1
#>  8 I002  name2
#>  9 I003  name3
#> 10 I003  name3
#> 11 I004  name4
#> 12 I002  name2
#> 13 I001  name1

Created on 2021-04-03 by the reprex package (v2.0.0)

This example doesn't however account for the multiple potential names provided. One way to get around this would be to create multiple entries for each drug temporarily as in the following example and then format the names in the original format.

One assumption I have made is that the drugs are listed within a character array with a semicolon followed by a space as the delimiter every time. Please correct me on this and I will update the code accordingly:

library(tidyverse)

Translation <- tribble(~"ID", ~"Name",
                 "I001", "name1",
                 "I002", "name2",
                 "I003", "name3",
                 "I004", "name4",)


df <- tribble(~"ID",
              "I001",
              "I001",
              "I004; I002",
              "I004",
              "I002",
              "I002",
              "I001; I003",
              "I002",
              "I003",
              "I003",
              "I004",
              "I002; I001",
              "I001"
              )
                  
df_with_uniqueID <- df %>% 
  #Creates unique identifier for each row
  mutate(uniqueNum = 1: length(df$ID)) 

# Replace IDs in characters with array of IDs
df_with_uniqueID$ID <- strsplit(df_with_uniqueID$ID, split = "; ")

# Give each ID its own column
unnest(df_with_uniqueID, cols = c(ID)) %>% 
  #right_join the results
  right_join(Translation, by = c("ID" = "ID")) %>% 
  #reduce the additional columns
  nest(cols = c(ID, Name)) %>% 
  # Convert the array of names to a single string
  mutate(names = map(cols, function(x) paste(x$Name, collapse = "; "))) %>% 
  # Unnest our strings to a column
  unnest(names) %>% 
  # Remove the column we no longer need
  select(-cols)
#> # A tibble: 13 x 2
#>    uniqueNum names       
#>        <int> <chr>       
#>  1         1 name1       
#>  2         2 name1       
#>  3         3 name4; name2
#>  4         4 name4       
#>  5         5 name2       
#>  6         6 name2       
#>  7         7 name1; name3
#>  8         8 name2       
#>  9         9 name3       
#> 10        10 name3       
#> 11        11 name4       
#> 12        12 name2; name1
#> 13        13 name1

Created on 2021-04-03 by the reprex package (v2.0.0)

Joel Kandiah
  • 1,465
  • 5
  • 15
  • Thank you @Joel for taking the time to answer this, that was EXACTLY what I needed along with tips along the way. I just had to convert my data to characters with 'DB %>% mutate_if(is.factor, as.character)' before the 'strsplit' – Aris77 Apr 04 '21 at 00:53