1

I have a dataframe that contains genes that are coded like P95, P104, etc. The number reflects the order in a gene name list

enter image description here

Gene names list (There are 2000 of them):

enter image description here

How to change the P## in the dataframe into gene names in this case?

UPD: here is an example dataframe and a gene names list:

gene <- c("(P10->UP)", "(P2->UP, P9->UP)", "(P10->UP, P3->UP)", "(P5->NORM, P7->UP)")
support <- c(0.95, 0.94, 0.93, 0.92)
df <- data.frame(gene, support)

gene_list <- c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J")

P10 corresponds to the 10th gene "J", P2 is "B", etc.

The result I want to obtain should look like this:

enter image description here

Yulia Kentieva
  • 641
  • 4
  • 13
  • Hi. A question: P96 from the first line should be the 96th element of your genes names list? – pbraeutigm Jan 04 '22 at 14:32
  • Can you post code to make this dataframe? I'm curious what is a type of column `gene`. Is it just a character, i.e. strings like "(P96->UP)" or is it something different? – gss Jan 04 '22 at 15:39
  • another question - what shape / file format is the raw data in which you read the data? Although I don't work with DNA data myself, I think (know) that there are specific file formats typically used with this data and also, more importantly, there are packages that might read them in a more efficient way (?) – tjebo Jan 04 '22 at 21:25
  • related: https://stackoverflow.com/questions/4350440/split-data-frame-string-column-into-multiple-columns, and here I particularly recommend this answer https://stackoverflow.com/a/47060452/7941188 which allows splitting a column into n (unknown) columns – tjebo Jan 04 '22 at 21:26

1 Answers1

2

One way might be with mutate and separate from the tidyverse package.

separate can't split a column into an unkown numbers of columns. Therefore I had to calculate the maximal number of genes in the gene column first (max_genes).

Data

gene <- c("(P10->UP)", "(P2->UP, P9->UP)", "(P10->UP, P3->UP)", "(P5->NORM, P7->UP)")
support <- c(0.95, 0.94, 0.93, 0.92)
df <- data.frame(gene, support)

gene_list <- c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J")

Code

# calculate max number of genes in column gene for spreading 
max_genes = ncol(str_extract_all(df$gene, "->", simplify = T))


df %>% 
  # remove brackets and spaces in column gene
  mutate(gene = str_remove_all(gene, "[(|)|\\s]")) %>%
  # separate gene into name and expresssion
  separate(col = gene,
           sep = "->|,",
           into  = paste0(c("gene_name", "exp"), 
                          rep(1:max_genes, each = 2)),
           fill = "right") %>%
  # substitute gene number with gene name
    mutate(across(starts_with("gene_name"), ~gene_list[as.numeric(str_remove(., "P"))]))

Output

  gene_name1 exp1 gene_name2 exp2 support
1          J   UP       <NA> <NA>    0.95
2          B   UP          I   UP    0.94
3          J   UP          C   UP    0.93
4          E NORM          G   UP    0.92
tamtam
  • 3,541
  • 1
  • 7
  • 21