I am trying to join decoded VIN data from NHTSA with vehicle data from fueleconomy.gov using year, make, and model. Below is an example of the data I am trying to join:
# This is the first dataframe
make <- c("PORSCHE", "TESLA", "MITSUBISHI")
model <- c("Cayenne", "Model S", "Outlander - PHEV")
year <- c(2017, 2013, 2018)
electrification_level <- (PHEV, BEV, PHEV)
vin_data <- data.frame(make, model, year, electrification_level)
# This is the second dataframe
make <- c("Porsche", "Tesla", "Mitsubishi")
# There are multiple versions of the models (an average of these would be ideal - e.g. avg. mpg)
model <- c("Cayenne S e-Hybrid", "Model S AWD - P85D", "Outlander 2WD")
year <- c(2017, 2013, 2018)
# These mpg are made up for the example
mpg <- c(75, 120, 80)
fueleconomy_data <- data.frame(make, model, year, mpg)
I am faced with multiple problems trying to complete this join.
- I need to join data using make and year, but make needs to be case insensitive.
- I need to perform an inexact match on model and perhaps average the mpg values for the model, as there are multiple entries in the fueleconomy.gov data for each model (e.g. 2WD, 4WD, different engine sizes, hybrid, etc.).
I have referenced the following questions to try and solve this riddle:
I also reached out to both fueleconomy.gov and NHTSA to see if they have capabilities to join data based on a vehicle ID, but wanted to ask the community if there might be a straightforward solution as well.