0

I am working on a project in R. I created a data frame for the table of all projects implemented by an Institutions. The data frame table includes a Country column with a name of the country in which the project is implemented

looks something like this with more than 20,000 rows

$ ProjectID                      <chr> "P163945", "P169561", "P171613", "P172627"…
$ Region                         <chr> "Africa West", "Africa East", "Africa West…
$ Country                        <chr> "Western Africa", "United Republic of Tanz…
$ PName                          <chr> "Investments towards Resilient Management …

I have also a second table which I scraped and it has country names as well but in shorter format

$ Rank                         <int> 1, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, …
$ `Country/region`             <chr> "Kenya", "Libya", "Dominica", "Ethiopia", "B…
$ `Real GDP growthrate (%)[1]` <chr> "1.9", "-66.7", "-8.8", "1.9", "3.8", "4.5",…

Now, I would like to rename the country names in the table 1 so they look identicaly like country names in table 2 (so ie. United Republic of Tanzania in table 1 becomes Tanzania from the column 2). I tried to use countrycode package but does not look like it can be helpful in my case. I would like to avoid manually renaming of more than 100 names. Once the names are the same in both columns I would like to use the SQL package in R to set the primary and foreign keys and join the data from the tables together. I will appreciate any advice!

  • Especially given what we have (i.e., only one complete `Country`), it's rather difficult to help you. I think the best method is to produce a frame mapping `Country` to `Country/region` in a new frame, and this is likely 100 rows long. Yes, this may need to be done programmatically, From there, you can `merge` your 20,000-row frame with this mapping frame and select the country column you need. (Refs on merge/join: https://stackoverflow.com/q/1299871/3358272, https://stackoverflow.com/a/6188334/3358272.) – r2evans Jun 08 '21 at 13:24
  • If this doesn't make sense, there's nothing I'll do with the samples we see here: please [edit] your question and paste the output from `dput(x)` for each frame, where `x` is just big enough to clearly provide an easy-to-use frame, but not so big as to completely clobber the page. There should be sufficient overlap in countries between the two frames, pls don't provide two samples with nothing in common. Thanks. – r2evans Jun 08 '21 at 13:25

2 Answers2

0

This is a string matching problem. Check out the stringdist package. The stringdistmatrix(a, b) function compares two vectors of strings.

So the strategy could be to calculate pairwise string distances and select the ones that indicate minimum distance.

dmat <- stringdistmatrix(table1$country, table2$country)
matched <- apply(dmat,1,which.min)
new_id <- table2$country[matched]

The new_id can then be added as column to table 1. A one-liner would be

table2$country[apply(stringdistmat(table1$country, table2$country), 1, which.min)]

You will need to check the result because ambiguities are likely (as in most string matching operations). But this method should narrow down the number of cases that need manual tweaking.

jkt
  • 946
  • 1
  • 7
  • 18
0

This is exactly the type of thing the countrycode package is intended for...

library(countrycode)

df1 <- 
  data.frame(
    ProjectID = c("P163945", "P169561", "P171613"),
    Region = c("Africa West", "Africa East", "Africa South"),
    Country = c(" Republic of Guinea-Bissau", "United Republic of Tanzania", "Republic of Madagascar")
  )

df2 <- 
  data.frame(
    Rank = c(1, 3, 4),
    `Country/region` = c("Tanzania", "Guinea-Bissau", "Madagascar"),
    `Real GDP growthrate (%)[1]` = c("1.9", "-66.7", "-8.8")
  )


df1$iso3c <- countrycode(df1$Country, "country.name", "iso3c")
df2$iso3c <- countrycode(df2$Country.region, "country.name", "iso3c")


dplyr::full_join(df1, df2, by = "iso3c")
#>   ProjectID       Region                     Country iso3c Rank Country.region
#> 1   P163945  Africa West   Republic of Guinea-Bissau   GNB    3  Guinea-Bissau
#> 2   P169561  Africa East United Republic of Tanzania   TZA    1       Tanzania
#> 3   P171613 Africa South      Republic of Madagascar   MDG    4     Madagascar
#>   Real.GDP.growthrate.....1.
#> 1                      -66.7
#> 2                        1.9
#> 3                       -8.8
CJ Yetman
  • 8,373
  • 2
  • 24
  • 56