0

I want to remove a part of the rownames in my data frame. I want to remove everything that do not match the string that is defined in the grepl below and replace it with the string defined behind. Does anyone know?

df[grepl(".*lncRNA.*|.*snRNA.*|.*snoRNA.*|.*precursor_RNA.*", rownames(df))] <- c("lncRNA","snRNA","snoRNA","precursor_RNA")



head(rownames(df))

[3208] "URS000075AF9C-snoRNA_GTATGTGTGGACAGCACTGAGACTGAGTCT"                
[3209] "URS000075B029-snRNA_AACTCTGAGTCTTAAGCTAATTTTTTGAGGCCTTGTTCCGACA"    
[3210] "URS000075B029-snRNA_ATTTCCGTGGAGAGGAACAACTCTGAGTCTTAAGCTAATTT"      
[3211] "URS000075B0E3-lncRNA_GTAAGGGGCAGTAAG"                               
[3212] "URS000075B261-precursor_RNA_CTTTCTATGCTCCTGTTCTGC"                  
[3213] "URS000075B2ED-lncRNA_CACTCAGGACCCACC"

out

[3208] "snoRNA"                
[3209] "snRNA"    
[3210] "snRNA"      
[3211] "lncRNA"                               
[3212] "precursor_RNA"                  
[3213] "lncRNA" 
user2300940
  • 2,355
  • 1
  • 22
  • 35

2 Answers2

4

We can use gsub to match one of more characters that are not a - ([^-]+) from the start (^) of the string followed by a - or (|) one or more characters that are not an underscore ([^_]+) until the end of the string ($) and replace it with blanks ("").

gsub("^[^-]+-|_[^_]+$", "", v1)
#[1] "snoRNA"        "snRNA"         "snRNA"         "lncRNA"       
#[5] "precursor_RNA" "lncRNA"  

If we are doing this on the rownames

gsub("^[^-]+-|_[^_]+$", "", rownames(df))

data

v1 <- c("URS000075AF9C-snoRNA_GTATGTGTGGACAGCACTGAGACTGAGTCT",
  "URS000075B029-snRNA_AACTCTGAGTCTTAAGCTAATTTTTTGAGGCCTTGTTCCGACA", 
"URS000075B029-snRNA_ATTTCCGTGGAGAGGAACAACTCTGAGTCTTAAGCTAATTT", 
"URS000075B0E3-lncRNA_GTAAGGGGCAGTAAG", 
"URS000075B261-precursor_RNA_CTTTCTATGCTCCTGTTCTGC", 
"URS000075B2ED-lncRNA_CACTCAGGACCCACC")
akrun
  • 874,273
  • 37
  • 540
  • 662
  • 2
    @csgillespie Thanks, I noticed that you and the other poster mentioned that already. So, I guess that would be sufficient. – akrun Oct 06 '16 at 13:38
4

Welcome to StackOverflow! You've done well with giving us some example input and output, but please consider providing a reproducible example to make it easier for us to help you.

In your case, I think you may be able to use sub, capture the middle, and \1 in the replacement.

x <- c("URS000075AF9C-snoRNA_GTATGTGTGGACAGCACTGAGACTGAGTCT",                
"URS000075B029-snRNA_AACTCTGAGTCTTAAGCTAATTTTTTGAGGCCTTGTTCCGACA",    
"URS000075B029-snRNA_ATTTCCGTGGAGAGGAACAACTCTGAGTCTTAAGCTAATTT",      
"URS000075B0E3-lncRNA_GTAAGGGGCAGTAAG",                               
"URS000075B261-precursor_RNA_CTTTCTATGCTCCTGTTCTGC",                  
"URS000075B2ED-lncRNA_CACTCAGGACCCACC")

# replace the string with the captured group (ie regex in brackets) 
gsub("^.*(lncRNA|snRNA|snoRNA|precursor_RNA).*$", "\\1", x)
# [1] "snoRNA"        "snRNA"         "snRNA"         "lncRNA"       
# [5] "precursor_RNA" "lncRNA"   

Rownames have to be unique though, so you may need to store the result in a column of your dataframe instead (or you could use make.unique() to make them unique, but I think saving the result as a column in your dataframe would make more sense).

Community
  • 1
  • 1
mathematical.coffee
  • 55,977
  • 11
  • 154
  • 194