How to re-code text that includes specific text

Question

I am trying to re-code a large set of text data into either a text or numeric value.

My data set includes names of coffee shops. I would like to re-code these coffee shops into either "corporation" or "small business". The problem is there are variations in how these coffee shops are spelled (e.g., starbucks vs. starbcks, vs. starbucks coffee). I would like to create a code that scans the dataset for the word "star" and re-codes it into "corporation".

Example data:

customers <- data.table(customer_id = c(1, 2, 3, 4, 5, 1, 2, 3, 4, 5), 
                        store = c("starbcks", "peets", "coffee bean", "drnk", "starbucks", "coffee ben", "coffee bean", "coffee bean", "drnk", "starbucks coffee"))

I would like to recode the "store" column into "type", which i would then factor and re-code into a numeric value.

customers <- data.table(customer_id = c(1, 2, 3, 4, 5, 1, 2, 3, 4, 5), 
                        store = c("starbcks coffee", "portfolios", "coffee bean", "sharkhead", "starbucks", "coffee ben", "cuppa cuppa", "coffee bean", "drnk", "starbucks coffee"),
                        type = c("corporation", "small business", "corporation", "small business", "corporation", "corporation", "small business", "corporation", "corporation", "corporation"),
                        rc_type = c(1, 2, 1, 2, 1, 1, 2, 1, 1, 1))

I have looked into the stringr package and tried the standard way of re-coding, but to no avail. Any help is appreciate. Thank you!

Likely duplicate: http://stackoverflow.com/questions/26405895/how-can-i-match-fuzzy-match-strings-from-two-datasets. There is no easy answer for fuzzy matching. I mean who's to say that "starbcks" isn't another company vs a misspelling. — MrFlick, Feb 28 '17 at 19:25
`grep("star",store)` would find all the locations of "star" in the store vector, then just need to set those to "corporation" in a new column. — Dan Slone, Feb 28 '17 at 21:05
Like so: `customers$type[grep("star",customers$store)] <- "corporation"`. I suspect that other misspellings, such as "strbucks" and having so many names will cause you grief. — Dan Slone, Feb 28 '17 at 21:12
If regular expressions aren't picking up all your cases you may want to look into edit-distance algorithms. In the RecordLinkage package there is a `jarowinkler()` function that will compute how similar two strings are. — gfgm, Mar 01 '17 at 00:21
Thank you @DanSlone this was very helpful! I tried creating a more efficient code, i.e., customers$type[grep("star", "bean", "drnk", customers$store)] <- "corporation" ... instead of writing a code for each individual coffee shop. However, this code did not work. Any suggestions? — Tawk_Tomahawk, Mar 01 '17 at 23:56

How to re-code text that includes specific text

0 Answers0