How to join dataframe on multiple columns and a fuzzy match on one?

Question

I am trying to join decoded VIN data from NHTSA with vehicle data from fueleconomy.gov using year, make, and model. Below is an example of the data I am trying to join:

# This is the first dataframe  
make <- c("PORSCHE", "TESLA", "MITSUBISHI")
model <- c("Cayenne", "Model S", "Outlander - PHEV")
year <- c(2017, 2013, 2018)
electrification_level <- (PHEV, BEV, PHEV)
vin_data <- data.frame(make, model, year, electrification_level)

# This is the second dataframe    
make <- c("Porsche", "Tesla", "Mitsubishi")
# There are multiple versions of the models (an average of these would be ideal - e.g. avg. mpg)
model <- c("Cayenne S e-Hybrid", "Model S AWD - P85D", "Outlander 2WD")
year <- c(2017, 2013, 2018)
# These mpg are made up for the example
mpg <- c(75, 120, 80)
fueleconomy_data <- data.frame(make, model, year, mpg)

I am faced with multiple problems trying to complete this join.

I need to join data using make and year, but make needs to be case insensitive.
I need to perform an inexact match on model and perhaps average the mpg values for the model, as there are multiple entries in the fueleconomy.gov data for each model (e.g. 2WD, 4WD, different engine sizes, hybrid, etc.).

I have referenced the following questions to try and solve this riddle:

I also reached out to both fueleconomy.gov and NHTSA to see if they have capabilities to join data based on a vehicle ID, but wanted to ask the community if there might be a straightforward solution as well.

This is difficult to assess with only three points for each dataset. Based on this, I would create a key for `model` that uses the first word. Based on what I see, that should get you close to a true join and you can filter accordingly. — akash87, Dec 17 '19 at 16:57

score 1 · Answer 1 · answered Dec 17 '19 at 16:58

There are a few typos in your reprex, so I'm pasting it again below.

# This is the first dataframe
make <- c("PORSCHE", "TESLA", "MITSUBISHI")
model <- c("Cayenne", "Model S", "Outlander - PHEV")
year <- c(2017, 2013, 2018)
electrification_level <- c("PHEV", "BEV", "PHEV")
vin_data <- data.frame(make, model, year, electrification_level, stringsAsFactors = FALSE)

# This is the second dataframe    
make <- c("Porsche", "Tesla", "Mitsubishi")
# There are multiple versions of the models (an average of these would be ideal - e.g. avg. mpg)
model <- c("Cayenne S e-Hybrid", "Model S AWD - P85D", "Outlander 2WD")
year <- c(2017, 2013, 2018)
# These mpg are made up for the example
mpg <- c(75, 120, 80)
fueleconomy_data <- data.frame(make, model, year, mpg, stringsAsFactors = FALSE)

For your first question, I would just change them all to uppercase with the toupper function and then use a full join.

df_joined <- vin_data %>% 
  full_join(fueleconomy_data %>% 
    dplyr::mutate(make = base::toupper(make)), by = "make")

For #2, you can use some if/else logic. I've had a go but you could tweak that to your hearts content.

df_joined %>%  
  dplyr::mutate(model_same = if_else(condition = word(model.x) == word(model.y), true = TRUE, false = FALSE))

score 0 · Answer 2 · answered Dec 17 '19 at 17:24

You can use RecordLinkae package to get what you want. You can play with the epiwt of 0.6 to increase or decrease the accuracy of text match.

library(RecordLinkage)
pairs <- compare.linkage(fueleconomy_data, vin_data, strcmp = 2, exclude=c(3,4), blockfld = 1) 
epiwt <- epiWeights(pairs) 
epiclass <- epiClassify(epiwt, .6)
getPairs(epiclass, show="links", single.rows=T)

make.1 model.1 make.2 model.2 year.2 electrification_level.2 3 MITSUBISHI Outlander 2WD MITSUBISHI Outlander - PHEV 2018 PHEV 1 PORSCHE Cayenne S e-Hybrid PORSCHE Cayenne 2017 PHEV 2 TESLA Model S AWD - P85D TESLA Model S 2013 BEV

This is to make the columns uppercase. You need to do that before the recordlinkage
vin_data$make <- toupper(vin_data$make) fueleconomy_data$make <- toupper(fueleconomy_data$make)

How to join dataframe on multiple columns and a fuzzy match on one?

2 Answers2