1

I'm trying to join a column of data to a lookup table. However, the data will contain the text in the column of the lookup table (but may contain arbitrary text before and after it).

This sounds similar to fuzzyjoin two data frames using data.table , but fuzzy matching won't work here. I need an exact match with a certain string and to ignore everything before and after. For example, I need "Jo Billiam" to match to "Jo", and not to "William".

Example data:

library(data.table)
data <- data.table(full.name=c('Jonathan Parker', 'Mr. Jonathan Smith', 'S. Patrick Jones'))
lookup <- data.table(name.long=c('William', 'Patrick', 'Jonathan'),
                     name.short=c('Bill', 'Pat', 'John'))

Desired Output:

data.table(full.name=c('Jonathan Parker', 'Mr. Jonathan Smith', 'S. Patrick Jones'),
            nickname=c('John', 'John', 'Pat'))

The two ideas I've had so far are:

  1. Use grepl with sapply to effectively loop through the whole lookup table
  2. Use grep to strip out anything that doesn't match one of the lookup strings, and then do a normal data.table join on that stripped out version.

Both of these seem quite inelegant though: is there a better way of doing this?

NB. In my data I don't have to worry about matching more than one lookup string

MattB
  • 651
  • 3
  • 11
  • 1
    So what then? A simple merge?? Also next time you think your Q is not a dupe, just comment under your Q. Don't ask same thing twice as you fill the site with 'noise' – Sotos Aug 27 '20 at 11:27
  • Thanks for your advice. The message I got from Stackoverflow was "If this question doesn’t resolve your question, ask a new one." It didn't, so I thought that asking a new question, more clearly phrased, was what was being recommended to me. Since the previous question has already been closed, I didn't think "noise" should be a significant issue either. – MattB Aug 27 '20 at 11:29
  • Key word there: 'New one' :) . I m reading it now. I ll try and help – Sotos Aug 27 '20 at 11:31

2 Answers2

1

Using SQL we can perform a left join on the indicated condition. The column names must be escaped using [...] because dot is an SQL operator. The double pipes mean string concatenation such that the expression involving them produces a string of the form %...% where % is a wildcard matching any string.

library(sqldf)

sqldf("select D.*, L.[name.short]
  from data D
  left join lookup L on d.[full.name] like '%' || L.[name.long] || '%'")

giving:

           full.name name.short
1    Jonathan Parker       John
2 Mr. Jonathan Smith       John
3   S. Patrick Jones        Pat

We can alternately use the instr function which is like grepl(..., fixed = TRUE). It gives the same output.

sqldf("select D.*, L.[name.short]
  from data D
  left join lookup L on instr(d.[full.name], L.[name.long])")
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • Thanks for the answer - this works perfectly on the test data, so I'll accept it if no better one comes along. Unfortunately when I applied it to my real data it failed, as my data has a few exotic classes of data in which sqldf can't handle. I can get around this by getting fewer columns from sqldf and then joining back to my original data. Still, I wish there were a more elegant (and faster) solution... – MattB Aug 27 '20 at 13:15
1

Maybe fuzzyjoin::regex_left_join solves your problem.

  1. Create a regex column: Find name.long and ignore eveything around
library(fuzzyjoin)
library(tidyverse)

lookup <- lookup %>% 
    mutate(regex = paste(".*\\b", name.long, "\\b.*", sep = ""))
  1. Joyn the tables by regex
output <- regex_left_join(data, 
                          lookup, 
                          by = c(full.name = "regex"))
  1. Select columns
output %>% 
  select(full.name, name.short)
tamtam
  • 3,541
  • 1
  • 7
  • 21