I am using the following 2 functions to find the names of countries in a string, match the name, put that into a new column in the dataframe, and then delete the country name from the original string:
library("stringr")
ListofCountries <- read.table(file="https://raw.github.com/umpirsky/country-list/master/country/cldr/en/country.csv",header=T,sep=",")
CoffeeTable <- data.frame(Product=c("Kenya Ndumberi", "Kenya Ndumberi", "Finca Nombre de Dios", "Finca La Providencia", "Las Penidas", "Las Penidas", "Las Penidas", "Panama Duncan", "Panama Duncan", "Panama Duncan", "Panama Duncan", "Panama Duncan", "Panama Duncan", "Progresso", "Progresso", "Progresso", "Progresso", "Finca El Injerto", "Finca El Injerto", "Finca El Injerto", "Finca El Injerto", "Finca El Injerto", "Finca El Injerto", "El Socoro Reserva Don Diego", "El Socoro Reserva Don Diego", "El Socoro Reserva Don Diego", "El Socoro Reserva Don Diego", "\nEl Socoro Reserva Don Diego", "El Socoro Reserva Don Diego", "Thiriku Nyeri", "Thiriku Nyeri", "Thiriku Nyeri", "Thiriku Nyeri", "Kenya Kia Oro", "Kenya Kia Oro", "Kenya Kia Oro", "Kenya Kia Oro", "Kenya Kia Oro", "Bufcafe Natural Sundried Microlot", "Bufcafe Natural Sundried Microlot", "Bufcafe Natural Sundried Microlot", "Geisha", "Geisha", "Geisha", "Pacamara", "Pacamara", "Pacamara", "Pacamara", "Bolivia", "Pacamara", "Bolivia", "Pacamara", "Bolivia", "Brazil yellow bourbon pea berry", "Finca El Vintilador", "\nWashed Yirgacheffe", "Finca El Vintilador", "Washed Yirgacheffe", "Washed Yirgacheffe", "Washed Yirgacheffe", "Leza", "Finca La Libertad", "Pacamara", "Pacamara", "Pacamara", "Finca La Bolsa", "Thunguri Kenya", "Thunguri Kenya", "Thunguri Kenya", "Thiriku Nyeri", "Thiriku Nyeri", "Thiriku Nyeri", "Pedregal", "Pedregal", "Barrel Aged", "Pedregal", "Barrel Aged", "Toarco Jaya Peaberry Sulawesi", "Amigo de Buesaco", "Amigo de Buesaco", "Amigo de Buesaco", "Barrel Aged", "Toarco Jaya Peaberry Sulawesi", "\nToarco Jaya Peaberry Sulawesi", "El Cypress", "El Cypress", "Kenya Kia Oro", "Kenya Kia Oro", "Kenya Kia Oro", "Kenya Kia Oro"))
CoffeeTable$Country <- str_trim(str_match(tolower(CoffeeTable$Product),
tolower(paste(ListofCountries, collapse="|")))[,1])
CoffeeTable$Product <- str_trim(gsub(tolower(paste(ListofCountries, collapse="|")), replacement="",
CoffeeTable$Product, ignore.case=T))
Problem 1 - this is very slow. How can I make these functions faster?
Problem 2 - this only catches formal names of countries. Does anyone know a good list of common country names? (for example 'China' vs 'The People Democratic Republic of China')
Thanks!
EDIT: Here is a list of 90 coffee names to make this a reproducible example; I want to add that in my actual application, CoffeeTable already exists and has ~2,000 rows and 45 columns. I'm not looking for faster ways to construct the data.frame / etc.
Thank you!
Edit 2: Question 2 has been answered, now I'm just trying to optimize the 2 functions so they don't take 5 - 10 seconds to run!