61

In R, I have a list of companies such as:

companies  <-  data.frame(Name=c("Company A Inc (COMPA)","Company B (BEELINE)", "Company C Inc. (Coco)", "Company D Inc.", "Company E"))

I want to remove the text with parenthesis, ending up with the following list:

                  Name
1        Company A Inc 
2            Company B
3       Company C Inc.
4       Company D Inc.
5            Company E

One approach I tried was to split the string and then use ldply:

companies$Name <- as.character(companies$Name)
c<-strsplit(companies$Name, "\\(")
ldply(c)

But because not all company names have parentheses portions, it fails:

Error in list_to_dataframe(res, attr(.data, "split_labels"), .id, id_as_factor) : 
  Results do not have equal lengths

I'm not married to the strsplit solution. Whatever removes that text and the parentheses would be fine.

Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
aiolias
  • 849
  • 1
  • 6
  • 8

7 Answers7

95

A gsub should work here

gsub("\\s*\\([^\\)]+\\)","",as.character(companies$Name))
# or using "raw" strings as of R 4.0
gsub(r"{\s*\([^\)]+\)}","",as.character(companies$Name))

# [1] "Company A Inc"  "Company B"      "Company C Inc."
# [4] "Company D Inc." "Company E" 

Here we just replace occurrences of "(...)" with nothing (also removing any leading space). R makes it look worse than it is with all the escaping we have to do for the parenthesis since they are special characters in regular expressions.

MrFlick
  • 195,160
  • 17
  • 277
  • 295
  • 1
    Why did you use `[^\\)]+` between the parentheses? – rrs Jun 11 '14 at 22:08
  • @rrs I wanted to match all non closing parenthesis characters. I think a non-greedy `.*?` would work as well but if I know the only thing that can end my match block I like to use that explicitly. – MrFlick Jun 11 '14 at 22:19
  • 2
    **NOTE**: To make sure only those parentheses that are at the end of string are removed, use `gsub("\\s*\\([^\\)]+\\)\\s*$","",as.character(companies$Name))` – Wiktor Stribiżew Sep 01 '17 at 12:11
22

You could use stringr::str_replace. It's nice because it accepts factor variables.

companies <- data.frame(Name=c("Company A Inc (COMPA)","Company B (BEELINE)", 
                               "Company C Inc. (Coco)", "Company D Inc.", 
                               "Company E"))

library(stringr)
str_replace(companies$Name, " \\s*\\([^\\)]+\\)", "")
# [1] "Company A Inc"  "Company B"      "Company C Inc." 
# [4] "Company D Inc." "Company E"

And if you still want to use strsplit, you could do

companies$Name <- as.character(companies$Name)
unlist(strsplit(companies$Name, " \\(.*\\)"))
# [1] "Company A Inc"  "Company B"      "Company C Inc."
# [4] "Company D Inc." "Company E" 
Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
Rich Scriven
  • 97,041
  • 11
  • 181
  • 245
10

You could also use:

library(qdap)
companies$Name <-  genX(companies$Name, " (", ")")

companies
        Name
1  Company A Inc
2       CompanyB
3 Company C Inc.
4 Company D Inc.
5       CompanyE
akrun
  • 874,273
  • 37
  • 540
  • 662
  • This code does not leave any space after value in () is removed. May I know if you have any solution for that? I'd like to leave single space – Zahra Hnn Oct 28 '18 at 13:54
  • @ZahraHnn If you check the code it is `" ("` Try with `"("` Not sure about your case though without a reproducible example – akrun Oct 28 '18 at 17:28
  • that won't work, actually I want to remove emojis which are like ; using: genX(mytext$text, "<", ">"), in cases which there's no space between text and emoji, the result will be unsatisfactory. eg,considering this text "I was soto see you" , the user used emoji with no space and when I removed the <><><> results is like "... soto see …" , but I was expecting "... so to see … " – Zahra Hnn Oct 29 '18 at 00:46
8

If the parentheses are paired and balanced, you can use

gsub("\\s*(\\([^()]*(?:(?1)[^()]*)*\\))", "", x, perl=TRUE)

See the regex and R demo online:

companies  <-  data.frame(Name=c("Company A Inc (COMPA)","Company B (BEELINE)", "Company C Inc. (Coco)", "Company D Inc.", "Company E"))
gsub("\\s*(\\([^()]*(?:(?1)[^()]*)*\\))", "", companies$Name, perl=TRUE)

Output:

[1] "Company A Inc"  "Company B"      "Company C Inc." "Company D Inc."
[5] "Company E"     

Regex details

  • \s* - zero or more whitespaces
  • (\([^()]*(?:(?1)[^()]*)*\)) - Capturing group 1 (required to recurse the pattern part between parentheses):
    • \( - a ( char
    • [^()]* - zero or more chars other than ( and )
    • (?:(?1)[^()]*)* - zero or more occurrences of the whole Group 1 pattern ((?1) is a regex subroutine recursing Group 1 pattern) and then zero or more chars other than ( and )
    • \) - a ) char.
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
5

In your case it will come to the desired result, wenn you remove everything starting with (.

sub(" \\(.*", "", companies$Name)
#[1] "Company A Inc"  "Company B"      "Company C Inc." "Company D Inc." "Company E"     

To remove parentheses and text within from a strings you can use.

sub("\\(.*)", "", c("ab (cd) ef", "(ij) kl"))
#[1] "ab  ef" " kl"   

If there are more than one parentheses:

gsub("\\(.*?)", "", c("ab (cd) ef (gh)", "(ij) kl"))
#[1] "ab  ef " " kl"    

( needs to be escaped \\(, . means everything, * means repeated 0 to n, ? means non greedy to remove not everything from the first to the last match.

As an alternative you can use [^)] what means everything but not a ).

sub("\\([^)]*)", "", c("ab (cd) ef", "(ij) kl"))
#[1] "ab  ef" " kl"   

gsub("\\([^)]*)", "", c("ab (cd) ef (gh)", "(ij) kl"))
#[1] "ab  ef " " kl"    

If there are nested parentheses:

gsub("\\(([^()]|(?R))*\\)", "", c("ab ((cd) ef) gh (ij)", "(ij) kl"), perl=TRUE)
#[1] "ab  gh " " kl"

Where a(?R)z is a recursion which match one or more letters a followed by exactly the same number of letters z.

GKi
  • 37,245
  • 2
  • 26
  • 48
4
library(qdap)
bracketX(companies$Name) -> companies$Name
0

Another gsub solution: replace the term in the parens preceded by an optional space by "", i.e. empty string

gsub("(\\s*\\(\\w+\\))", "", companies$Name)

[1] "Company A Inc"  "Company B"      "Company C Inc." "Company D Inc."
[5] "Company E" 
Eyayaw
  • 1,033
  • 5
  • 10