3

I have a data frame that includes a column of messy strings. Each messy string includes the name of a single country somewhere in it. Here's a toy version:

df <- data.frame(string = c("Russia is cool (2015) ",
                            "I like - China",
                            "Stuff happens in North Korea"),
                 stringsAsFactors = FALSE)

Thanks to the countrycode package, I also have a second data set that includes two useful columns: one with regexs for country names (regex) and another with the associated country name (country.name). We can load this data set like this:

library(countrycode)
data(countrycode_data)

I would like to write code that uses the regular expressions in countrycode_data$regex to spot the country name in each row of df$string; associates that regex with the proper country name in countrycode_data$country.name; and, finally, writes that name to the relevant position in a new column, df$country. After performing this TBD operation, df would look like this:

                        string                                country
1       Russia is cool (2015)                      Russian Federation
2               I like - China                                  China
3 Stuff happens in North Korea Korea, Democratic People's Republic of

I can't quite wrap my head around how to do this. I have tried using various combinations of grepl, which, tolower, and %in%, but I'm getting the direction or dimensions (or both) wrong.

David Arenburg
  • 91,361
  • 17
  • 137
  • 196
ulfelder
  • 5,305
  • 1
  • 22
  • 40
  • I'm not seeing a `regex` column in the `countrycode_data` data frame?... EDIT, nevermind, I think I found it, called `country.name.en.regex`? – rosscova Feb 14 '17 at 20:38
  • The relevant column in `countrycode_data` should just be called `regex`. The associated column with proper names is `country.name`. – ulfelder Feb 14 '17 at 20:43
  • possibly something like this can help: http://stackoverflow.com/questions/21165256/r-merge-data-frames-allow-inexact-id-matching-e-g-with-additional-characters – Bulat Feb 14 '17 at 21:05
  • 1
    @ulfelder The regex columns was renamed country.name.en.regex in version 0.19 of the package. I'm the countrycode author and cjyetman gives the correct answer below. countrycode should work out of the box for your use-case, but you just ran into a know regex issue for North Korea. Should work for most other countries. – Vincent Feb 19 '17 at 11:58

5 Answers5

2

This is exactly the purpose for the countrycode package, so there's no reason to recode this by yourself. Just use it like this...

library(countrycode)
df <- data.frame(string = c("Russia is cool (2015) ", "I like - China",
                            "Stuff happens in North Korea"), stringsAsFactors = FALSE)

df$country.name <- countrycode(df$string, 'country.name', 'country.name')

specifically in this case, it will not find an unambiguous match for "Stuff happens in North Korea", but that's actually a problem with the regexes for North Korea and South Korea (I opened an issue for that here https://github.com/vincentarelbundock/countrycode/issues/139). Otherwise, what you want to do should work in principal.

(side note specifically to @ulfelder: a new version of countrycode was just released on CRAN, v0.19. The column names have changed a bit since we added new languages, so country.name is now country.name.en, and regex is now country.name.en.regex)

CJ Yetman
  • 8,373
  • 2
  • 24
  • 56
2

I am the countrycode maintainer. @cj-yetman gave the correct answer. The specific North Korea problem you encountered has now been fixed in the development version of countrycode on Github.

You can use countrycode directly to convert sentences to country names or codes:

> library(devtools)
> install_github('vincentarelbundock/countrycode')
> library(countrycode)
> df <- data.frame(string = c("Russia is cool (2015) ",
+                             "I like - China",
+                             "Stuff happens in North Korea"),
+                  stringsAsFactors = FALSE)
> df$iso3c = countrycode(df$string, 'country.name', 'country.name')
> df
                        string                                 iso3c
1       Russia is cool (2015)                     Russian Federation
2               I like - China                                 China
3 Stuff happens in North Korea Democratic People's Republic of Korea
Vincent
  • 15,809
  • 7
  • 37
  • 39
  • 1
    Thanks, @Vincent! In a way, I'm glad I got a more general answer before getting the `countrycode`-specific one, because this might come up for me again in situations where there isn't a package that solves the problem. – ulfelder Feb 20 '17 at 10:31
  • is there an efficient way to use `countrycode` to catch multiple country names in a single string? E.g., if I have the string "Reports of the Secretary-General on the Sudan and South Sudan" and I want to return a string like "Sudan; South Sudan"? I know how to do the collapsing. It's returning more than one match that stumps me. – ulfelder Feb 28 '17 at 15:33
  • 1
    Not out of the box with countrycode, but if you look at the internal code, the package already keeps track of multiple matches. You can just use the same code and catch ``destination_list``. See here: https://github.com/vincentarelbundock/countrycode/blob/master/R/countrycode.R#L123 – Vincent Mar 01 '17 at 01:21
1

Here's a working solution, but I'm referencing different column names in the countrycode_data frame, because they come up differently on my system. I've also resorted to a few *apply calls, which is probably not ideal. I'm sure you could vectorize a few of those, I'm just not sure how myself.

matches <- sapply( df$string, function( x ) {

    # find matches by running all regex strings (maybe cound be vectorised?)
    find.match <- lapply( countrycode_data$country.name.en.regex, grep, x = x, ignore.case = TRUE, perl = TRUE )

    # note down which patterns came up with a match
    matches <- which( sapply( find.match, length ) > 0 )

    # now cull the matches list down to only those with a match
    find.match <- find.match[ sapply( find.match, length ) > 0 ]

    # get rid of NA matches (not sure why these come up)
    matches <- matches[ sapply( find.match, is.na ) == FALSE ]

    # now only return the value (reference to the match) if there is one (otherwise we get empty returns)
    ifelse( length( matches ) == 0, NA_integer_, matches )
} )

# now use the vector of references to match up country names
df$country <- countrycode_data$country.name.en[ matches ]

> df
                        string            country
1       Russia is cool (2015)  Russian Federation
2               I like - China              China
3 Stuff happens in North Korea               <NA>

NOTE: Your third string "Stuff happens in North Korea" should match to row 128 in the countrycode_data set, but it doesn't. I think the reason is that the regex there ( ^(?=.*democrat|people|north|d.*p.*.r).*\bkorea|dprk|korea.*(d.*p.*r) ) seems to specify that the "north" must be the start of the string. I'm not good with regex myself, but I believe that's what the ^ is specifying. See what happens to the three text strings below:

grepl( "^(?=.*democrat|people|north|d.*p.*.r).*\\bkorea|dprk|korea.*(d.*p.*r)",
       c( "korea", "north korea", "aaa north korea" ),
       perl = TRUE, ignore.case = TRUE )
# [1] FALSE  TRUE FALSE
rosscova
  • 5,430
  • 1
  • 22
  • 35
1

I would go with a for loop in this case, but looping notably over the rows of the countrycode_data data.frame since that only has some 200 rows whereas the real world original data might be orders of magnitude larger.

Because of the long names, I extract two columns of the country code data:

patt <- countrycode_data$country.name.en.regex[!is.na(countrycode_data$country.name.en.regex)]
name <- countrycode_data$country.name.en[!is.na(countrycode_data$country.name.en.regex)]

Then we can loop to write the new column:

for(i in seq_along(patt)) {
  df$country[grepl(patt[i], df$string, ignore.case=TRUE, perl=TRUE)] <- name[i]
}

As others have pointed out, North Korea doesn't match with the regex specified in the country code data.

talat
  • 68,970
  • 21
  • 126
  • 157
  • 1
    Elegant, thank you. (And, as it happens, I actually get the desired result for "North Korea", too.) – ulfelder Feb 14 '17 at 21:47
  • 1
    Yes, good thinking. I was thinking the same using `stringi`, something like `which(sapply(countrycode_data$country.name.en.regex, stringi::stri_detect_regex, str = tolower(df$string)), arr.ind = TRUE)` (where `col` is the row index within `countrycode_data$country.name.en`) – David Arenburg Feb 14 '17 at 21:51
  • @DavidArenburg also a good Alternative. In the end you have to make one (and only one) loop some way. stringi might boost the regex matching noticeably (and of course could be also adopted in my approach) – talat Feb 14 '17 at 21:57
0

Here is possible solution with cross-join (which will blow-up your data)

library(countrycode)
data(countrycode_data)

library(data.table)
df <- data.table(string = c("Russia is cool (2015) ",
                            "I like - China",
                            "Stuff happens in North Korea"),
                 stringsAsFactors = FALSE)

# adding dummy for full cross-join merge
df$dummy <- 0L
country.dt <- data.table(countrycode_data[, c("country.name.en", "country.name.en.regex")])
country.dt$dummy <- 0L

# merging original data to countries to get all possible combinations
res.dt <- merge(df, country.dt, by ="dummy", all = TRUE, allow.cartesian = TRUE)

# there are cases with NA regex
res.dt <- res.dt[!is.na(country.name.en.regex)]

# find matches
res.dt[, match := grepl(country.name.en.regex, string, perl = T, ignore.case = T), by = 1:nrow(res.dt)]

# filter out matches
res.dt <- res.dt[match == TRUE, .(string, country.name.en)]
res.dt

#                    string    country.name.en
# 1:  Russia is cool (2015) Russian Federation
# 2:         I like - China              China
Bulat
  • 6,869
  • 1
  • 29
  • 52
  • 1
    Why cross join if you eventually just doing by row operations? Could just do a simple `sapply` IMO. – David Arenburg Feb 14 '17 at 21:21
  • I agree, in this particular case it is not a very good solution as expected number of matches is low. But it can be useful otherwise for similar tasks – Bulat Feb 14 '17 at 21:35