I'm trying to join a column of data to a lookup table. However, the data will contain the text in the column of the lookup table (but may contain arbitrary text before and after it).
This sounds similar to fuzzyjoin two data frames using data.table , but fuzzy matching won't work here. I need an exact match with a certain string and to ignore everything before and after. For example, I need "Jo Billiam" to match to "Jo", and not to "William".
Example data:
library(data.table)
data <- data.table(full.name=c('Jonathan Parker', 'Mr. Jonathan Smith', 'S. Patrick Jones'))
lookup <- data.table(name.long=c('William', 'Patrick', 'Jonathan'),
name.short=c('Bill', 'Pat', 'John'))
Desired Output:
data.table(full.name=c('Jonathan Parker', 'Mr. Jonathan Smith', 'S. Patrick Jones'),
nickname=c('John', 'John', 'Pat'))
The two ideas I've had so far are:
- Use grepl with sapply to effectively loop through the whole lookup table
- Use grep to strip out anything that doesn't match one of the lookup strings, and then do a normal data.table join on that stripped out version.
Both of these seem quite inelegant though: is there a better way of doing this?
NB. In my data I don't have to worry about matching more than one lookup string