Optimize regular expression - matching, extracting, deleating

Question

I am using the following 2 functions to find the names of countries in a string, match the name, put that into a new column in the dataframe, and then delete the country name from the original string:

library("stringr")

ListofCountries <- read.table(file="https://raw.github.com/umpirsky/country-list/master/country/cldr/en/country.csv",header=T,sep=",")

CoffeeTable <- data.frame(Product=c("Kenya Ndumberi", "Kenya Ndumberi", "Finca Nombre de Dios", "Finca La Providencia", "Las Penidas", "Las Penidas", "Las Penidas", "Panama Duncan", "Panama Duncan", "Panama Duncan", "Panama Duncan", "Panama Duncan", "Panama Duncan", "Progresso", "Progresso", "Progresso", "Progresso", "Finca El Injerto", "Finca El Injerto", "Finca El Injerto", "Finca El Injerto", "Finca El Injerto", "Finca El Injerto", "El Socoro Reserva Don Diego", "El Socoro Reserva Don Diego", "El Socoro Reserva Don Diego", "El Socoro Reserva Don Diego", "\nEl Socoro Reserva Don Diego", "El Socoro Reserva Don Diego", "Thiriku Nyeri", "Thiriku Nyeri", "Thiriku Nyeri", "Thiriku Nyeri", "Kenya Kia Oro", "Kenya Kia Oro", "Kenya Kia Oro", "Kenya Kia Oro", "Kenya Kia Oro", "Bufcafe Natural Sundried Microlot", "Bufcafe Natural Sundried Microlot", "Bufcafe Natural Sundried Microlot", "Geisha", "Geisha", "Geisha", "Pacamara", "Pacamara", "Pacamara", "Pacamara", "Bolivia", "Pacamara", "Bolivia", "Pacamara", "Bolivia", "Brazil yellow bourbon pea berry", "Finca El Vintilador", "\nWashed Yirgacheffe", "Finca El Vintilador", "Washed Yirgacheffe", "Washed Yirgacheffe", "Washed Yirgacheffe", "Leza", "Finca La Libertad", "Pacamara", "Pacamara", "Pacamara", "Finca La Bolsa", "Thunguri Kenya", "Thunguri Kenya", "Thunguri Kenya", "Thiriku Nyeri", "Thiriku Nyeri", "Thiriku Nyeri", "Pedregal", "Pedregal", "Barrel Aged", "Pedregal", "Barrel Aged", "Toarco Jaya Peaberry Sulawesi", "Amigo de Buesaco", "Amigo de Buesaco", "Amigo de Buesaco", "Barrel Aged", "Toarco Jaya Peaberry Sulawesi", "\nToarco Jaya Peaberry Sulawesi", "El Cypress", "El Cypress", "Kenya Kia Oro", "Kenya Kia Oro", "Kenya Kia Oro", "Kenya Kia Oro"))

CoffeeTable$Country <- str_trim(str_match(tolower(CoffeeTable$Product), 
                                            tolower(paste(ListofCountries, collapse="|")))[,1])


CoffeeTable$Product <- str_trim(gsub(tolower(paste(ListofCountries, collapse="|")), replacement="", 
                          CoffeeTable$Product, ignore.case=T))

Problem 1 - this is very slow. How can I make these functions faster?

Problem 2 - this only catches formal names of countries. Does anyone know a good list of common country names? (for example 'China' vs 'The People Democratic Republic of China')

Thanks!

EDIT: Here is a list of 90 coffee names to make this a reproducible example; I want to add that in my actual application, CoffeeTable already exists and has ~2,000 rows and 45 columns. I'm not looking for faster ways to construct the data.frame / etc.

Thank you!

Edit 2: Question 2 has been answered, now I'm just trying to optimize the 2 functions so they don't take 5 - 10 seconds to run!

Hi, can you please provide a reproducible example? In your code above the CoffeeTable structure is undefined. — datawookie, Dec 18 '13 at 03:46
To provide a reproducible data set, try using `dput` and pasting its output into your question. For example, try this `dput(head(countrycode_data,100))` and `dput(head(CoffeeTable,100))` for the first 100 rows of both tables. — Tommy O'Dell, Dec 18 '13 at 05:46
Hi @TommyO'Dell, I have edit the question and have made this a reproducible example - Thanks! — JayCo, Dec 18 '13 at 17:28
Hi @exegetic, I have made this a reproducible example - Thank you! — JayCo, Dec 18 '13 at 17:30
@JayCo - In your CoffeeTable example, you need to enclose the product names in quotes or R treats them like variable names. Did this code run for you?? — jlhoward, Dec 18 '13 at 17:41

score 0 · Accepted Answer · edited May 23 '17 at 10:25

For your second problem, there is an extensive list of options here. Try this:

countries <- read.table(file="https://raw.github.com/umpirsky/country-list/master/country/cldr/en/country.csv",header=T,sep=",")

Edit: in response to OP's comment.

Given the sample data you provided, and replicating 25X to create about the same number of rows as in you actual data, your code runs in about 1.6 sec. It's difficult to believe there is an 8-fold difference between your system and mine, so there must be something else going on.

The only thing I can recommend is to look at strapplyc(...) in the gsubfn package. This is supposed to be extremely efficient, but on my system was actually slower than your code.

See code below for examples and benchmarks. Sorry I could not be of more help...

library(stringr)
df <- CoffeeTable
df$Product=as.vector(df$Product)
df=rbind(df,df,df,df,df)    # replicate 25X
df=rbind(df,df,df,df,df)    # total rows = 2250

pattern    <- tolower(paste(ListofCountries$name,collapse="|"))

f1 = function(df){
  df$Country <- str_trim(str_match(tolower(df$Product), pattern)[,1])
  df$Product <- str_trim(gsub(pattern, "",df$Product, ignore.case=T))
  return(df)
}

library(gsubfn)
library(tcltk2)
f2 = function(df){
  df$Country <- strapplyc(tolower(df$Product),pattern)
  df$Product <- str_trim(gsub(pattern,"", df$Product, ignore.case=T))
  return(df)
}

library(microbenchmark)
microbenchmark(df1<-f1(df),df2<-f2(df),times=10)
# Unit: seconds
#           expr      min       lq   median       uq      max neval
#  df1 <- f1(df) 1.365222 1.506017 1.611458 1.689611 1.722626    10
#  df2 <- f2(df) 2.006162 2.055963 2.148158 2.249707 2.285955    10

This is the answer to my question 2, Thanks! I have edited the 1st question — JayCo, Dec 18 '13 at 17:32
I have marked this correct; it seems you are right that the original code is the most efficient way to do this... the ~2 seconds that you mention is probably right as well - this is part of a much larger filtering function that does other similar things, which takes ~7 seconds to run. I do think that this is the least efficient part of it though! — JayCo, Dec 20 '13 at 18:14
Try using Rprof (`?Rprof`). It will tell you where R is spending it's time. — jlhoward, Dec 20 '13 at 18:20

score 0 · Answer 2 · answered Dec 19 '13 at 09:34

Okay, back to the first problem. This might not be the most efficient solution, but it works.

The first thing that I would suggest is that you specify stringsAsFactors = FALSE when you generate the CoffeeTable data frame initially. Otherwise you end up with factors. I have also renamed the column of initial data in this table as Composite so that you can see the results of the separation.

match <- gregexpr(tolower(paste(ListofCountries$name, collapse="|")),
    tolower(CoffeeTable$Composite))
CoffeeTable$Country <- sapply(regmatches(CoffeeTable$Composite, match),
    function(m) {ifelse(length(m), m, "")})
CoffeeTable$Product <- sapply(regmatches(CoffeeTable$Composite, match, invert = TRUE),\
    function(m) {paste0(m, collapse = "")})

The results look like this:

> head(CoffeeTable, 10)
              Composite Country              Product
1        Kenya Ndumberi   Kenya             Ndumberi
2        Kenya Ndumberi   Kenya             Ndumberi
3  Finca Nombre de Dios         Finca Nombre de Dios
4  Finca La Providencia         Finca La Providencia
5           Las Penidas                  Las Penidas
6           Las Penidas                  Las Penidas
7           Las Penidas                  Las Penidas
8         Panama Duncan  Panama               Duncan
9         Panama Duncan  Panama               Duncan
10        Panama Duncan  Panama               Duncan

my code was working so this really didn't answer my question about optimization and efficiency! you didn't include timings with your code, but the `ifelse` statement makes me think it will be much slower as that's not a vectorized function! (am I wrong?) — JayCo, Dec 20 '13 at 18:17

Optimize regular expression - matching, extracting, deleating

2 Answers2