Matching multiple rows in one dataset to a value in another dataset to create a new column

Question

I have a dataset that has US cities present in it but the states are currently all labeled as US - the dataset is called us_data. I have found a dataset containing all US cities and states - called us - and I'm trying to create a new state column in my us_data dataset by taking the city name, finding it in us and pulling the state from us to add to the new column in us_data.

I'm using R and not sure how to go about this. I don't believe a regular join will work as there can be multiple observations in us_data with the same city so all of the rows with the same city would need to be matched. I was thinking of using mutate() from dplyr but not sure how to reference two datasets in the function call so any help would be appreciated! I've attached a glimpse of both datasets for reference.

us_data

us

> dput(us_data[1:10,1:7])
structure(list(name = c("Carpenter Rd.", "1515 N. Sheridan - Wilmette", 
"S McCarran & E Greg St - Sparks", "Hwy 20 & Tharp - Yuba City", 
"Greenmount & I-64", "Veterans Blvd & Kingman St", "Hampden & Dayton - Denver", 
"50th and Kipling-Wheatridge, CO", "Higuera & Tank Farm", "Burr Ridge-I-55 & County Line Rd"
), url = c("https://www.starbucks.com/store-locator/store/6323", 
"https://www.starbucks.com/store-locator/store/6325", "https://www.starbucks.com/store-locator/store/6327", 
"https://www.starbucks.com/store-locator/store/6328", "https://www.starbucks.com/store-locator/store/6329", 
"https://www.starbucks.com/store-locator/store/6330", "https://www.starbucks.com/store-locator/store/6334", 
"https://www.starbucks.com/store-locator/store/6333", "https://www.starbucks.com/store-locator/store/6331", 
"https://www.starbucks.com/store-locator/store/6340"), street_address = c("3650 Carpenter Rd.", 
"1515 North Sheridan, Building 4", "1560 S. Stanford Way, Suite A", 
"1615 Colusa Hwy, Ste 100", "1126 Central Park Drive", "4312 Veterans Blvd.", 
"9925 East Hampden Ave", "4975 Kipling St", "3971 S. Higuera Street", 
"515 Village Center Dr."), city = c("Pittsfield", "Wilmette", 
"Sparks", "Yuba City", "OFallon", "Metairie", "Denver", "Wheat Ridge", 
"San Luis Obispo", "Burr Ridge"), state = c("US", "US", "US", 
"US", "US", "US", "US", "US", "US", "US"), zip_code = c("48104", 
"600911822", "894316331", "959939437", "622691769", "70006", 
"802314903", "800332340", "934011580", "605274516"), country = c("US", 
"US", "US", "US", "US", "US", "US", "US", "US", "US")), row.names = c(NA, 
10L), class = "data.frame")

> dput(us[1:20,])
structure(list(city = c("New York", "Los Angeles", "Chicago", 
"Miami", "Dallas", "Philadelphia", "Houston", "Atlanta", "Washington", 
"Boston", "Phoenix", "Seattle", "San Francisco", "Detroit", "San Diego", 
"Minneapolis", "Tampa", "Denver", "Brooklyn", "Queens"), city_ascii = c("New York", 
"Los Angeles", "Chicago", "Miami", "Dallas", "Philadelphia", 
"Houston", "Atlanta", "Washington", "Boston", "Phoenix", "Seattle", 
"San Francisco", "Detroit", "San Diego", "Minneapolis", "Tampa", 
"Denver", "Brooklyn", "Queens"), state_id = c("NY", "CA", "IL", 
"FL", "TX", "PA", "TX", "GA", "DC", "MA", "AZ", "WA", "CA", "MI", 
"CA", "MN", "FL", "CO", "NY", "NY"), state_name = c("New York", 
"California", "Illinois", "Florida", "Texas", "Pennsylvania", 
"Texas", "Georgia", "District of Columbia", "Massachusetts", 
"Arizona", "Washington", "California", "Michigan", "California", 
"Minnesota", "Florida", "Colorado", "New York", "New York")), row.names = c(NA, 
20L), class = "data.frame")

Hello, and welcome! Kindly provide a [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) of your datasets, by pasting into your answer the outputs from `dput(us_data)` and `dput(us)`. — Greg, Nov 23 '21 at 22:50
ok. It is not easy without seeint your data, but as I can observe from `us_data` dataset, you also have Zip_codes. So, the easiest way is to create a new column with 5-digit Zip_code (extract first 5 digits from every zip_code in dataset), and then match it State by using package `usa`, which contains zip_codes: `usa::zipcodes`. — Bloxx, Nov 23 '21 at 23:04
@Greg I updated the question with the dput outputs, if I did it incorrectly please let me know! — famato, Nov 23 '21 at 23:15
@famato It works on my end. :) Though you might want to include some cities from `us` whose names appear in `us_data`. — Greg, Nov 23 '21 at 23:16
@famato Have you even read my comment? I wrote you have to shorten your Zip-codes to the first 5 digits. And yes, it is correct! — Bloxx, Nov 23 '21 at 23:20

score 1 · Accepted Answer · answered Nov 23 '21 at 23:21

1

Use match:

us_data$State_NEW <- us$state_name[match(us_data$city, us$city)]

answered Nov 23 '21 at 23:21

Chris Ruehlemann

20,321
4
12
34

this works perfectly, thank you! – famato Nov 23 '21 at 23:29
@famato Just FYI, if either answer solved your question, you should upvote it* and [mark it as accepted](https://meta.stackexchange.com/questions/5234/how-does-accepting-an-answer-work/5235#5235). _*You can only upvote once you have 15+ reputation._ – Greg Nov 24 '21 at 03:20

Greg · Answer 2 · 2021-11-23T23:27:32.710

The following solution with dplyr::*_join() should work, and it should help you gain comfort with the dplyr workflow.

I am assuming your goal is to enrich us_data with more detailed info on the state.

library(dplyr)

# ...
# Code to generate 'us_data' and 'us'.
# ...


us_data %>%
  # OPTIONALLY reduce ZIP codes to only their first 5 digits.
  mutate(
    zip_code = substr(zip_code, 1, 5),
  ) %>%
  # Match the US data to the proper states.
  left_join(
    us,
    by = "city"
  ) %>%
  # Remove the unhelpful 'state' column, which only shows "US".
  select(!state)

Matching multiple rows in one dataset to a value in another dataset to create a new column

2 Answers2