0

I am trying to re-code a large set of text data into either a text or numeric value.

My data set includes names of coffee shops. I would like to re-code these coffee shops into either "corporation" or "small business". The problem is there are variations in how these coffee shops are spelled (e.g., starbucks vs. starbcks, vs. starbucks coffee). I would like to create a code that scans the dataset for the word "star" and re-codes it into "corporation".

Example data:

customers <- data.table(customer_id = c(1, 2, 3, 4, 5, 1, 2, 3, 4, 5), 
                        store = c("starbcks", "peets", "coffee bean", "drnk", "starbucks", "coffee ben", "coffee bean", "coffee bean", "drnk", "starbucks coffee"))

I would like to recode the "store" column into "type", which i would then factor and re-code into a numeric value.

customers <- data.table(customer_id = c(1, 2, 3, 4, 5, 1, 2, 3, 4, 5), 
                        store = c("starbcks coffee", "portfolios", "coffee bean", "sharkhead", "starbucks", "coffee ben", "cuppa cuppa", "coffee bean", "drnk", "starbucks coffee"),
                        type = c("corporation", "small business", "corporation", "small business", "corporation", "corporation", "small business", "corporation", "corporation", "corporation"),
                        rc_type = c(1, 2, 1, 2, 1, 1, 2, 1, 1, 1)) 

I have looked into the stringr package and tried the standard way of re-coding, but to no avail. Any help is appreciate. Thank you!

Tawk_Tomahawk
  • 139
  • 2
  • 8
  • 1
    Likely duplicate: http://stackoverflow.com/questions/26405895/how-can-i-match-fuzzy-match-strings-from-two-datasets. There is no easy answer for fuzzy matching. I mean who's to say that "starbcks" isn't another company vs a misspelling. – MrFlick Feb 28 '17 at 19:25
  • `grep("star",store)` would find all the locations of "star" in the store vector, then just need to set those to "corporation" in a new column. – Dan Slone Feb 28 '17 at 21:05
  • Like so: `customers$type[grep("star",customers$store)] <- "corporation"`. I suspect that other misspellings, such as "strbucks" and having so many names will cause you grief. – Dan Slone Feb 28 '17 at 21:12
  • 1
    If regular expressions aren't picking up all your cases you may want to look into edit-distance algorithms. In the RecordLinkage package there is a `jarowinkler()` function that will compute how similar two strings are. – gfgm Mar 01 '17 at 00:21
  • Thank you @DanSlone this was very helpful! I tried creating a more efficient code, i.e., customers$type[grep("star", "bean", "drnk", customers$store)] <- "corporation" ... instead of writing a code for each individual coffee shop. However, this code did not work. Any suggestions? – Tawk_Tomahawk Mar 01 '17 at 23:56

0 Answers0