0

I have the following strings

string <- c("Trade (% of GDP) – Latin America & Caribbean (WB/WDI/NE.TRD.GNFS.ZS-ZJ)", "Trade (% of GDP) – North America (WB/WDI/NE.TRD.GNFS.ZS-XU)"  )

In this string, I want to remove everything except the ``middle" sentence.

My expected result should look like this:

excpected_string <- c("Latin America & Caribbean", "North America"  )

Can someone help me how I can do this using gsub

msh855
  • 1,493
  • 1
  • 15
  • 36
  • 1
    I think what you are looking for might be regular expressions. – RmbRT Mar 10 '19 at 22:13
  • Possible duplicate of [Remove parentheses and text within from strings in R](https://stackoverflow.com/questions/24173194/remove-parentheses-and-text-within-from-strings-in-r) – divibisan Mar 12 '19 at 00:52

2 Answers2

1

You can do this with a regular expression. Based on the two examples, the two patterns I identified were 1) remove everything before -, and 2) remove everything within parens ().

Here's one solution to do that:

string <- c("Trade (% of GDP) – Latin America & Caribbean (WB/WDI/NE.TRD.GNFS.ZS-ZJ)", "Trade (% of GDP) – North America (WB/WDI/NE.TRD.GNFS.ZS-XU)"  )
gsub("^.*\\s–\\s|\\s*\\([^\\)]+\\)", "", string)
#> [1] "Latin America & Caribbean" "North America"

Created on 2019-03-10 by the reprex package (v0.2.1)

The first part of the regex ^.*\\s–\\s says "grab all the characters from the start of the string before we find -".

In regex, the | means OR, so the second regex \\s*\\([^\\)]+\\ identifies all text (and leading / trailing spaces) inside parens. Credit to this question for that regex.

Chase
  • 67,710
  • 18
  • 144
  • 161
  • For some unknown to me reason, I re-run the code today and the string ´Trade` Didn’t go. I tried to see if there were any unintentional changes in my code, and I found none. Do you think think problem has to do with R? – msh855 Mar 12 '19 at 23:07
  • @msh855 - very unlikely anything would have corrupted within R itself (as in I've used R for 10+ years and never experienced anything like that). Did your input `string` change? It's more probable that the format of `string` changed and the regex is no longer capturing the parts of the string. Can you re-upload `string`? – Chase Mar 13 '19 at 16:17
1

Another idea

trimws(sub(".*–([^\\(]+).*", "\\1", string))
# [1] "Latin America & Caribbean" "North America" 

Removes everything up to and including as well as what follows an opening bracket (. We use a capture group to isolate the desired output. trimws removes leading and trailing whitespaces.

markus
  • 25,843
  • 5
  • 39
  • 58