Remove a string except words in specific position in R

Question

I have the following strings

string <- c("Trade (% of GDP) – Latin America & Caribbean (WB/WDI/NE.TRD.GNFS.ZS-ZJ)", "Trade (% of GDP) – North America (WB/WDI/NE.TRD.GNFS.ZS-XU)"  )

In this string, I want to remove everything except the ``middle" sentence.

My expected result should look like this:

excpected_string <- c("Latin America & Caribbean", "North America"  )

Can someone help me how I can do this using gsub

I think what you are looking for might be regular expressions. — RmbRT, Mar 10 '19 at 22:13
Possible duplicate of [Remove parentheses and text within from strings in R](https://stackoverflow.com/questions/24173194/remove-parentheses-and-text-within-from-strings-in-r) — divibisan, Mar 12 '19 at 00:52

score 1 · Accepted Answer · answered Mar 10 '19 at 22:22

You can do this with a regular expression. Based on the two examples, the two patterns I identified were 1) remove everything before -, and 2) remove everything within parens ().

Here's one solution to do that:

string <- c("Trade (% of GDP) – Latin America & Caribbean (WB/WDI/NE.TRD.GNFS.ZS-ZJ)", "Trade (% of GDP) – North America (WB/WDI/NE.TRD.GNFS.ZS-XU)"  )
gsub("^.*\\s–\\s|\\s*\\([^\\)]+\\)", "", string)
#> [1] "Latin America & Caribbean" "North America"

^{Created on 2019-03-10 by the reprex package (v0.2.1)}

The first part of the regex ^.*\\s–\\s says "grab all the characters from the start of the string before we find -".

In regex, the | means OR, so the second regex \\s*\\([^\\)]+\\ identifies all text (and leading / trailing spaces) inside parens. Credit to this question for that regex.

For some unknown to me reason, I re-run the code today and the string ´Trade` Didn’t go. I tried to see if there were any unintentional changes in my code, and I found none. Do you think think problem has to do with R? — msh855, Mar 12 '19 at 23:07
@msh855 - very unlikely anything would have corrupted within R itself (as in I've used R for 10+ years and never experienced anything like that). Did your input `string` change? It's more probable that the format of `string` changed and the regex is no longer capturing the parts of the string. Can you re-upload `string`? — Chase, Mar 13 '19 at 16:17

markus · Answer 2 · 2019-03-10T22:28:23.617

1

Another idea

trimws(sub(".*–([^\\(]+).*", "\\1", string))
# [1] "Latin America & Caribbean" "North America"

Removes everything up to and including – as well as what follows an opening bracket (. We use a capture group to isolate the desired output. trimws removes leading and trailing whitespaces.

edited Mar 10 '19 at 22:28

answered Mar 10 '19 at 22:23

markus

25,843
5
39
58

Remove a string except words in specific position in R

2 Answers2