3

I have a large list which contains expressed genes from many cell lines. Ensembl genes often come with version suffixes, but I need to remove them. I've found several references that describe this here or here, but they will not work for me, likely because of my data structure (I think its a nested array within a list?). Can someone help me with the particulars of the code and with my understanding of my own data structures?

Here's some example data

>listOfGenes_version <- list("cellLine1" = c("ENSG001.1", "ENSG002.1", "ENSG003.1"), "cellLine2" = c("ENSG003.1", "ENSG004.1"))

>listOfGenes_version
$cellLine1
[1] "ENSG001.1" "ENSG002.1" "ENSG003.1"

$cellLine2
[1] "ENSG003.1" "ENSG004.1"

And what I would like to see is

>listOfGenes_trimmed
$cellLine1
[1] "ENSG001" "ENSG002" "ENSG003"

$cellLine2
[1] "ENSG003" "ENSG004"

Here are some things I tried, but did not work

>listOfGenes_trimmed <- str_replace(listOfGenes_version, pattern = ".[0-9]+$", replacement = "")      
Warning message:
In stri_replace_first_regex(string, pattern, fix_replacement(replacement),  :
  argument is not an atomic vector; coercing  

>listOfGenes_trimmed <- lapply(listOfGenes_version, gsub('\\..*', '', listOfGenes_version))
Error in match.fun(FUN) : 
  'gsub("\\..*", "", listOfGenes_version)' is not a function, character or symbol

Thanks so much!

strugglebus
  • 45
  • 1
  • 9

1 Answers1

1

An option would be to specify the pattern as . (metacharacter - so escape) followeed by one or more digits (\\d+) at the end ($) of the string and replace with blank ('")

lapply(listOfGenes_version,  sub, pattern = "\\.\\d+$", replacement = "")
#$cellLine1
#[1] "ENSG001" "ENSG002" "ENSG003"

#$cellLine2
#[1] "ENSG003" "ENSG004"

The . is a metacharacter that matches any character, so we need to escape it to get the literal value as the mode is by default regex

akrun
  • 874,273
  • 37
  • 540
  • 662