Remove parentheses and text within from strings in R

Question

In R, I have a list of companies such as:

companies  <-  data.frame(Name=c("Company A Inc (COMPA)","Company B (BEELINE)", "Company C Inc. (Coco)", "Company D Inc.", "Company E"))

I want to remove the text with parenthesis, ending up with the following list:

                  Name
1        Company A Inc 
2            Company B
3       Company C Inc.
4       Company D Inc.
5            Company E

One approach I tried was to split the string and then use ldply:

companies$Name <- as.character(companies$Name)
c<-strsplit(companies$Name, "\\(")
ldply(c)

But because not all company names have parentheses portions, it fails:

Error in list_to_dataframe(res, attr(.data, "split_labels"), .id, id_as_factor) : 
  Results do not have equal lengths

I'm not married to the strsplit solution. Whatever removes that text and the parentheses would be fine.

Also see `bracketX` in the `qdap` package. – Tyler Rinker Jun 11 '14 at 23:24 — Tyler Rinker, Jun 11 '14 at 23:24

MrFlick · Accepted Answer · 2022-04-19T17:52:48.113

95

A gsub should work here

gsub("\\s*\\([^\\)]+\\)","",as.character(companies$Name))
# or using "raw" strings as of R 4.0
gsub(r"{\s*\([^\)]+\)}","",as.character(companies$Name))

# [1] "Company A Inc"  "Company B"      "Company C Inc."
# [4] "Company D Inc." "Company E"

Here we just replace occurrences of "(...)" with nothing (also removing any leading space). R makes it look worse than it is with all the escaping we have to do for the parenthesis since they are special characters in regular expressions.

edited Apr 19 '22 at 17:52

answered Jun 11 '14 at 21:56

MrFlick

195,160
17
277
295

1

Why did you use `[^\\)]+` between the parentheses? – rrs Jun 11 '14 at 22:08
@rrs I wanted to match all non closing parenthesis characters. I think a non-greedy `.*?` would work as well but if I know the only thing that can end my match block I like to use that explicitly. – MrFlick Jun 11 '14 at 22:19
2

**NOTE**: To make sure only those parentheses that are at the end of string are removed, use `gsub("\\s*\$[^\$]+\\)\\s*$","",as.character(companies$Name))` – Wiktor Stribiżew Sep 01 '17 at 12:11

score 22 · Answer 2 · edited Nov 12 '20 at 04:48

22

You could use stringr::str_replace. It's nice because it accepts factor variables.

companies <- data.frame(Name=c("Company A Inc (COMPA)","Company B (BEELINE)", 
                               "Company C Inc. (Coco)", "Company D Inc.", 
                               "Company E"))

library(stringr)
str_replace(companies$Name, " \\s*\\([^\\)]+\\)", "")
# [1] "Company A Inc"  "Company B"      "Company C Inc." 
# [4] "Company D Inc." "Company E"

And if you still want to use strsplit, you could do

companies$Name <- as.character(companies$Name)
unlist(strsplit(companies$Name, " \\(.*\\)"))
# [1] "Company A Inc"  "Company B"      "Company C Inc."
# [4] "Company D Inc." "Company E"

edited Nov 12 '20 at 04:48

Gregor Thomas

136,190
20
167
294

answered Jun 12 '14 at 00:15

Rich Scriven

97,041
11
181
245

1

same idea using `stringi`: `stringi::stri_replace(companies$Name, regex = " \\s*\$[^\$]+\\)", "")` – user63230 Sep 01 '22 at 16:55
just to make it a bit shorter, `str_remove(companies$Name, " \\s*\$[^\$]+\\)")` – Steve Powell Jul 13 '23 at 03:57

akrun · Answer 3 · 2014-07-16T11:28:22.120

10

You could also use:

library(qdap)
companies$Name <-  genX(companies$Name, " (", ")")

companies
        Name
1  Company A Inc
2       CompanyB
3 Company C Inc.
4 Company D Inc.
5       CompanyE

edited Jul 16 '14 at 11:28

answered Jun 12 '14 at 01:45

akrun

874,273
37
540
662

This code does not leave any space after value in () is removed. May I know if you have any solution for that? I'd like to leave single space – Zahra Hnn Oct 28 '18 at 13:54
@ZahraHnn If you check the code it is `" ("` Try with `"("` Not sure about your case though without a reproducible example – akrun Oct 28 '18 at 17:28
that won't work, actually I want to remove emojis which are like ; using: genX(mytext$text, "<", ">"), in cases which there's no space between text and emoji, the result will be unsatisfactory. eg,considering this text "I was soto see you" , the user used emoji with no space and when I removed the <><><> results is like "... soto see …" , but I was expecting "... so to see … " – Zahra Hnn Oct 29 '18 at 00:46

score 8 · Answer 4 · answered Dec 09 '20 at 22:13

If the parentheses are paired and balanced, you can use

gsub("\\s*(\\([^()]*(?:(?1)[^()]*)*\\))", "", x, perl=TRUE)

See the regex and R demo online:

companies  <-  data.frame(Name=c("Company A Inc (COMPA)","Company B (BEELINE)", "Company C Inc. (Coco)", "Company D Inc.", "Company E"))
gsub("\\s*(\\([^()]*(?:(?1)[^()]*)*\\))", "", companies$Name, perl=TRUE)

Output:

[1] "Company A Inc"  "Company B"      "Company C Inc." "Company D Inc."
[5] "Company E"

Regex details

\s* - zero or more whitespaces
($[^()]*(?:(?1)[^()]*)*$) - Capturing group 1 (required to recurse the pattern part between parentheses):
- $ - a ( char
- [^()]* - zero or more chars other than ( and )
- (?:(?1)[^()]*)* - zero or more occurrences of the whole Group 1 pattern ((?1) is a regex subroutine recursing Group 1 pattern) and then zero or more chars other than ( and )
- $ - a ) char.

GKi · Answer 5 · 2021-09-13T06:59:57.143

In your case it will come to the desired result, wenn you remove everything starting with (.

sub(" \\(.*", "", companies$Name)
#[1] "Company A Inc"  "Company B"      "Company C Inc." "Company D Inc." "Company E"

To remove parentheses and text within from a strings you can use.

sub("\\(.*)", "", c("ab (cd) ef", "(ij) kl"))
#[1] "ab  ef" " kl"

If there are more than one parentheses:

gsub("\\(.*?)", "", c("ab (cd) ef (gh)", "(ij) kl"))
#[1] "ab  ef " " kl"

( needs to be escaped \\(, . means everything, * means repeated 0 to n, ? means non greedy to remove not everything from the first to the last match.

As an alternative you can use [^)] what means everything but not a ).

sub("\\([^)]*)", "", c("ab (cd) ef", "(ij) kl"))
#[1] "ab  ef" " kl"   

gsub("\\([^)]*)", "", c("ab (cd) ef (gh)", "(ij) kl"))
#[1] "ab  ef " " kl"

If there are nested parentheses:

gsub("\\(([^()]|(?R))*\\)", "", c("ab ((cd) ef) gh (ij)", "(ij) kl"), perl=TRUE)
#[1] "ab  gh " " kl"

Where a(?R)z is a recursion which match one or more letters a followed by exactly the same number of letters z.

score 4 · Answer 6 · answered Feb 17 '19 at 01:46

4

library(qdap)
bracketX(companies$Name) -> companies$Name

answered Feb 17 '19 at 01:46

Thushara Dulam

41
2

4

Could you please explain your answer? – tshimkus Feb 17 '19 at 02:52
bracketX - Apply bracket removal to character vectors. – SanMelkote Jul 06 '23 at 09:59

score 0 · Answer 7 · answered Jun 18 '20 at 08:31

0

Another gsub solution: replace the term in the parens preceded by an optional space by "", i.e. empty string

gsub("(\\s*\\(\\w+\\))", "", companies$Name)

[1] "Company A Inc"  "Company B"      "Company C Inc." "Company D Inc."
[5] "Company E"

answered Jun 18 '20 at 08:31

Eyayaw

1,033
5
10

Remove parentheses and text within from strings in R

7 Answers7

Linked

Related