I have a data frame that includes a column of messy strings. Each messy string includes the name of a single country somewhere in it. Here's a toy version:
df <- data.frame(string = c("Russia is cool (2015) ",
"I like - China",
"Stuff happens in North Korea"),
stringsAsFactors = FALSE)
Thanks to the countrycode
package, I also have a second data set that includes two useful columns: one with regexs for country names (regex
) and another with the associated country name (country.name
). We can load this data set like this:
library(countrycode)
data(countrycode_data)
I would like to write code that uses the regular expressions in countrycode_data$regex
to spot the country name in each row of df$string
; associates that regex with the proper country name in countrycode_data$country.name
; and, finally, writes that name to the relevant position in a new column, df$country
. After performing this TBD operation, df
would look like this:
string country
1 Russia is cool (2015) Russian Federation
2 I like - China China
3 Stuff happens in North Korea Korea, Democratic People's Republic of
I can't quite wrap my head around how to do this. I have tried using various combinations of grepl
, which
, tolower
, and %in%
, but I'm getting the direction or dimensions (or both) wrong.