How can I remove extra spaces, special characters, and unwanted text from a list of country names in R?

Question

I have a list that I've extracted from a table for the purpose of cleaning up the data and then adding it back as a new clean column. The column originally included country names and codes with some special characters ("*"). So far, I have this code working to remove the codes in parentheses and the special characters (which might not be the easiest way to do it), however the last line isn't removing the spaces:

> dput(head(country.names, 10))
c(" United States (USA)", " China (CHN)", " Japan (JPN)*", " Great Britain (GBR)", 
" ROC (ROC)", " Australia (AUS)", " Netherlands (NED)", " France (FRA)", 
" Germany (GER)", " Italy (ITA)")

So far, I have this code working to remove the codes in parentheses and the special characters (which might not be the easiest way to do it), however the last line isn't removing the spaces:

> name <- gsub("\\([^\\)]*\\)", "", country.names) %>% 
+   gsub("\\*", "", .) %>%
+   gsub("^[[:space:]]+|$[[space:]]+", "", .)

(I also tried gsub("^ | $", "", .) and trimws(name, which = "both") to remove spaces without luck)

This is a sample of the output I have using this code:

 [1]" United States "  " China " " Japan " " Great Britain " " ROC " " Australia " " Netherlands "       
 [8] " France " " Germany " " Italy " " Canada " " Brazil " " New Zealand " " Cuba "

I had previously tried `trimws` as well. I've updated my question to reflect that. Thank you for your comment! — data_life, Nov 08 '21 at 05:42
the input is more important, add `dput(country.names)` to your question — rawr, Nov 08 '21 at 06:04
both the second gsub and the trimws versions work as expected for me. note that the second gsub removes only one white space from the start and end of the string while the `trimws` removes any amount. based on your output, running `trimws` or `gsub` again would get all the white space, but since that is your last step anyway, i'm not sure why you still see the whitespace based on the example you provided — rawr, Nov 08 '21 at 06:31
i would do something like this `trimws(gsub('\\(.*', '', country.names))` but could be too brute force for some of your strings — rawr, Nov 08 '21 at 06:33
Thanks for your help! I actually just used `trimws` and added in the `whitespace = "[\\h\\v]"` argument and it worked! — data_life, Nov 08 '21 at 06:41

Wiktor Stribiżew · Accepted Answer · 2021-11-08T08:46:32.577

You must be having issues due to the Unicode whitespace chars in your input.

You can use

trimws(gsub("\\([^()]*\\)|[^[:alpha:][:space:]]", "", country.names))
# => [1] "United States" "China"         "Japan"     
#    [4] "Great Britain" "ROC"           "Australia"   
#    [7] "Netherlands"   "France"        "Germany"       "Italy"

The regex matches

$[^()]*$ - any substrings between closest parentheses
| - or
[^[:alpha:][:space:]] - any char other than a letter or whitespace (this is not fully Unicode aware, that is why it also removes all unusual whitespace).

Hence only regular ASCII whitespace is kept, trimws without any additional arguments works fine.

If the country names can contain accented letters, you will have to use PCRE Unicode-aware regex:

trimws(gsub("(*UCP)\\([^()]*\\)|[^\\p{L}\\s]", "", country.names, perl=TRUE), whitespace="[\\p{Z}\t]")

Here, [^\p{L}\s] (with (*UCP) PCRE flag) matches any char but a Unicode letter or whitespace and [\p{Z}\t] matches any Unicode whitespace.

How can I remove extra spaces, special characters, and unwanted text from a list of country names in R?

1 Answers1