Efficient String Search and Replace

Question

I am trying to clean about 2 million entries in a database consisting of job titles. Many have several abbreviations that I wish to change to a single consistent and more easily searchable option. So far I am simply running through the column with individual mapply(gsub(...) commands. But I have about 80 changes to make this way, so it takes almost 30 minutes to run. There has got to be a better way. I'm new to string searching, I found the *$ trick, which helped. Is there a way to do more than one search in a single mapply? I imagine that maybe faster? Any help would be great. Thanks.

Here is some of the code below. Test is a column of 2 million individual job titles.

test <- mapply(gsub, " Admin ", " Administrator ", test)
test <- mapply(gsub, "Admin ", "Administrator ", test)
test <- mapply(gsub, " Admin*$", " Administrator", test)
test <- mapply(gsub, "Acc ", " Accounting ", test)
test <- mapply(gsub, " Admstr ", " Administrator ", test)
test <- mapply(gsub, " Anlyst ", " Analyst ", test)
test <- mapply(gsub, "Anlyst ", "Analyst ", test)
test <- mapply(gsub, " Asst ", " Assistant ", test)
test <- mapply(gsub, "Asst ", "Assistant ", test)
test <- mapply(gsub, " Assoc ", " Associate ", test)
test <- mapply(gsub, "Assoc ", "Associate ", test)

This might be useful: http://stackoverflow.com/questions/26171318/regex-for-preserving-case-pattern-capitalization/26171700 — thelatemail, Nov 18 '15 at 05:43

score 5 · Answer 1 · answered Nov 18 '15 at 05:12

5

One option would be to use mgsub from library(qdap)

mgsub(patternVec, replaceVec, test)

data

patternVec <- c(" Admin ", "Admin ")
replaceVec <- c(" Administrator ",  "Administrator ")

answered Nov 18 '15 at 05:12

akrun

874,273
37
540
662

1

@DickMcManus It may be worth trying essentially the same thing using the `stringi` library, which is supposed to be pretty efficient: `stri_replace_first_regex(test, patternVec, replaceVec)`. – Jota Nov 18 '15 at 05:40
1

@Jota - a quick test suggests that `stri` would be about 1.5x faster than just using a `for` loop, which should be faster than anything else here. – thelatemail Nov 18 '15 at 06:05

score 3 · Accepted Answer · answered Nov 18 '15 at 05:23

3

Here is a base R solution which works. You can define a data frame which will contain all patterns and their replacements. Then you use apply() in row mode and call gsub() on your test vector for each pattern/replacement combination. Here is sample code demonstrating this:

df <- data.frame(pattern=c(" Admin ", "Admin "),
                 replacement=c(" Administrator ", "Administrator "))

test <- c(" Admin ", "Admin ")

apply(df, 1, function(x) {
                test <<- gsub(x[1], x[2], test)
             })

> test
[1] " Administrator " "Administrator "

answered Nov 18 '15 at 05:23

Tim Biegeleisen

502,043
27
286
360

@thelatemail I have never seen this post, nor did I use SO to generate my answer. – Tim Biegeleisen Nov 18 '15 at 05:45
I wasn't suggesting you did (or that you did anything untoward at all) - just linking an old post so that questions are connected together. – thelatemail Nov 18 '15 at 05:46
Cool...no worries...thanks for adding this information. – Tim Biegeleisen Nov 18 '15 at 05:47
@thelatemail I just noticed your comment once I answered lol. I have a similar answer in the link you provided. – hwnd Nov 18 '15 at 05:49
@hwnd - :-D - I guess the only question now is whether `stri_replace....` functions would speed it up massively. – thelatemail Nov 18 '15 at 05:53
@thelatemail Definitely **stringi** - `STRINGI 56.203 62.0680 87.22181 80.6385 107.2740 271.727 100` – hwnd Nov 18 '15 at 06:13
Thank you, this cut the run time by over a fourth and is along the lines of what I was thinking. – Dick McManus Nov 19 '15 at 07:48
Thanks @Dick and feel free to upvote if you want :-) – Tim Biegeleisen Nov 19 '15 at 07:48

Efficient String Search and Replace

2 Answers2

data

Related