0

The column names in my data frame contain names similar to "S156 B1-1 U500 (HTA-1 0).SST RMA gene.sst-rma-gene-full-Signal". I want to remove all after parenthesis ( including parenthesis).

I have seen extract a substring in R according to a pattern and Getting a sub string from a vector of strings topics but still wondering.

I have tried sub('(HTA-1 0).*','', colnames(data)) but the output is like S156 B1-1 U500 (. How should I remove the parenthesis? Thanks

Community
  • 1
  • 1
Seymoo
  • 177
  • 2
  • 15

2 Answers2

2

A good regular expression will handle this.

String =  "S156 B1-1 U500 (HTA-1 0).SST RMA gene.sst-rma-gene-full-Signal"
sub("(.*?)\\(.*", "\\1", String)
[1] "S156 B1-1 U500 "

Some detail:
The \\( part looks for an open parenthesis. (.*?) in front of that turns the part of the string before the parenthesis into a capture group. Period . matches any character. .* means zero or more characters - as many as it takes to get to the parenthesis that follows. I used .*? because the default is "greedy" matching, taking as much as possible which would go until the last open parenthesis. By adding ?, it turns off the greediness and only goes to the first parenthesis. The whole .*? part is inside parentheses (.*?). That is what makes it a capture group so whatever matches this part is stored in the variable \1.
.* after the parenthesis matches the rest of the string. Thus the pattern matches everything in the string, saving the part before the parenthesis. It is replaced by the captured string. Inside sub, the second argument is what will replace the matched string. I used \\1 to tell it to use the variable \1. The extra backslash is needed because backslash escapes characters so I have to escape the escape character to say that I just mean the character backslash.

G5W
  • 36,531
  • 10
  • 47
  • 80
2

It is not clear about the expected output. If we want to remove the substring after the ), then match the ) followed by characters (.*) and replace it with )

sub("\\).*", ")", str1)
#[1] "S156 B1-1 U500 (HTA-1 0)"

Or if we want to remove the strings beginning from the (, match 0 or more space (\\s*) followed by ( and other characters and replace it with blank ("")

sub("\\s*\\(.*", "", str1)
#[1] "S156 B1-1 U500"

A faster alternative of the above regex is using stri_replace from stringi

library(stringi)
stri_replace(str1, regex = "\\s*\\(.*", "")
#[1] "S156 B1-1 U500"

data

str1 <- "S156 B1-1 U500 (HTA-1 0).SST RMA gene.sst-rma-gene-full-Signal"
akrun
  • 874,273
  • 37
  • 540
  • 662