3

How can I remove any characters and digits before "_"; as an example:

> char <- c("SRR04_d3_GCTCGGTAAGCACCTCGCCACATA","SRR04_d1_ACTCGGTAAGCACCTCGCCACATA",
+           "JH-HL_GCTCGGTAAGCATGTCGCCACATA","HZ04_d5_GCTCGGTAAGCACCTCGCCACATA")
> c("GCTCGGTAAGCACCTCGCCACATA","ACTCGGTAAGCACCTCGCCACATA",
+           "GCTCGGTAAGCATGTCGCCACATA","GCTCGGTAAGCACCTCGCCACATA")
[1] "GCTCGGTAAGCACCTCGCCACATA" "ACTCGGTAAGCACCTCGCCACATA" "GCTCGGTAAGCATGTCGCCACATA"
[4] "GCTCGGTAAGCACCTCGCCACATA"

Can I do this with str_replace function from tidyverse

Z. Zhang
  • 637
  • 4
  • 16

6 Answers6

2

You may do this with sub -

sub('.*_', '', char)

#[1] "GCTCGGTAAGCACCTCGCCACATA" "ACTCGGTAAGCACCTCGCCACATA"
#[3] "GCTCGGTAAGCATGTCGCCACATA" "GCTCGGTAAGCACCTCGCCACATA"

Or if you prefer stringr functions.

stringr::str_remove(char, '.*_')
stringr::str_replace(char, '.*_', '')
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • 1
    This removes the _ I'm not 100% sure the OP wanted that. He said before. If he doesn't then the '' in sub or str_replace can be '_'. In the sample data there is a string of the form "sometext_moretext_lasttext" what does the OP want as the result? moretext_lasttext or just lasttext - check you get what you want/expect – CALUM Polwart Oct 10 '21 at 07:11
2

Base R:

Or use strsplit and sapply:

> sapply(strsplit(char, '_'), tail, n=1)
[1] "GCTCGGTAAGCACCTCGCCACATA" "ACTCGGTAAGCACCTCGCCACATA" "GCTCGGTAAGCATGTCGCCACATA" "GCTCGGTAAGCACCTCGCCACATA"
> 
U13-Forward
  • 69,221
  • 14
  • 89
  • 114
2

Here is an alternative way:

library(stringr)
str_replace_all(char, ".*_(?=[^:]+$)", "")

output:

[1] "GCTCGGTAAGCACCTCGCCACATA" "ACTCGGTAAGCACCTCGCCACATA" "GCTCGGTAAGCATGTCGCCACATA"
[4] "GCTCGGTAAGCACCTCGCCACATA"
TarJae
  • 72,363
  • 6
  • 19
  • 66
2

We may use trimws from base R

trimws(char, whitespace = ".*_")
[1] "GCTCGGTAAGCACCTCGCCACATA" "ACTCGGTAAGCACCTCGCCACATA" 
[3] "GCTCGGTAAGCATGTCGCCACATA" "GCTCGGTAAGCACCTCGCCACATA"
akrun
  • 874,273
  • 37
  • 540
  • 662
1

The package stringr can be used to extract all the letters at the end of the string with:

library(stringr)
str_extract(char, "[[:alpha:]]*$")
# [1] "GCTCGGTAAGCACCTCGCCACATA" "ACTCGGTAAGCACCTCGCCACATA" "GCTCGGTAAGCATGTCGCCACATA"
# [4] "GCTCGGTAAGCACCTCGCCACATA"
Miff
  • 7,486
  • 20
  • 20
1

I would phrase your problem using gsub with the pattern [^\W_]+_. This will target one or more alphanumeric characters before an underscore, any number of times.

char <- c("SRR04_d3_GCTCGGTAAGCACCTCGCCACATA","SRR04_d1_ACTCGGTAAGCACCTCGCCACATA",
      "JH-HL_GCTCGGTAAGCATGTCGCCACATA","HZ04_d5_GCTCGGTAAGCACCTCGCCACATA")
output <- gsub("[^\\W_]+_", "", char)
output

[1] "GCTCGGTAAGCACCTCGCCACATA" "ACTCGGTAAGCACCTCGCCACATA"
[3] "GCTCGGTAAGCATGTCGCCACATA" "GCTCGGTAAGCACCTCGCCACATA"
Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360