Categorizing Data Based on Text Value in New Column

Question

I'm trying to take an existing data frame that has a column for state and add a new column called Region depending on what the row's state is. So for example any row that has "CA" should be categorized "West" and any row that has "IL" should be Midwest. There are 4 regions: West, South, Midwest, and Northeast.

I had tried doing this separately in 4 code chunks like this:

south <- c("FL", "KY", "GA", "TX", "MS", "SC", "NC", "AL", "LA", "AR", "TN", "VA", "DC", "MD", "DE", "WV") #16 states
south.mdata <- mdata %>% filter(state %in% south)       #1832 locations
south.byyear <- south.mdata %>% group_by(Year) %>% summarize(s.total = n())
south.total <- data %>% filter(state %in% south) %>% group_by(Year) %>% summarize(yearly.total = n())

But this seems repetitive and not the most efficient way to do this. Plus I'd like to be able to group_by both Year and Region so I can compare across regions.

I'm having trouble implementing this and the first thing that comes to mind is to do some sort of if/else loop using filter but I know loops aren't really R's style.

The original data looks like this:

 Field.1    ID              title description                  streetaddress           city state
1      74 DE074    Cork 'n' Bottle             Route 14, 1 mile south of town Rehoboth Beach    DE
2      75 DE075    Cork 'n' Bottle             Route 14, 1 mile south of town Rehoboth Beach    DE
3      23 DE023          Dog House                           1200 DuPont Hwy.     Wilmington    DE
4      19 DE019          Dog House                            1200 DuPont Hwy     Wilmington    DE
5      26 DE026          Dog House                                1200 Dupont     Wilmington    DE
6      65 DE065 Henlopen Hotel Bar                           Boardwalk & Surf Rehoboth Beach    DE
  amenityfeatures             type Year notes       lon      lat
1         (M),(R)       Restaurant 1977  <NA> -75.07601 38.72095
2         (M),(R)       Restaurant 1976  <NA> -75.07601 38.72095
3         (M),(R)       Restaurant 1975  <NA> -75.58243 39.68839
4         (M),(R)       Restaurant 1976  <NA> -75.58243 39.68839
5         (M),(R)       Restaurant 1974  <NA> -75.58723 39.76705
6             (M) Bars/Clubs,Hotel 1972  <NA> -75.07712 38.72280
                                                                      status
1 Location could not be verified. General city or location coordinates used.
2 Location could not be verified. General city or location coordinates used.
3                                                   Google Verified Location
4                                                   Google Verified Location
5                                                   Google Verified Location
6                                                          Verified Location

I want to add a new column called "Region" that would loop through each row, look at the state, and then add a value to Region.

Any suggestions on the right syntax to do something like this would be so appreciated! Thanks so much!

Use a lookup table - create a data frame with 2 columns - `state` and `region`. Then use a join to add `region` to your current table: `mydata %>% left_join(region_lookup, by = "state")`. — Gregor Thomas, Nov 02 '20 at 16:14

score 1 · Accepted Answer · answered Nov 02 '20 at 16:22

This is a snippet of what the solution suggested by Gregor’s comment could look like.

library(tidyverse)

orig_data <- 
  tribble(~ID, ~state,
          1,   "CA",
          2,   "FL",
          3,   "DE")

region_lookup <- 
  tribble(~state, ~region,
          "CA",   "west",
          "FL",   "south",
          "DE",   "south")

left_join(orig_data, region_lookup)
#> Joining, by = "state"
#> # A tibble: 3 x 3
#>      ID state region
#>   <dbl> <chr> <chr> 
#> 1     1 CA    west  
#> 2     2 FL    south 
#> 3     3 DE    south

^{Created on 2020-11-02 by the reprex package (v0.3.0)}

Thanks so much - this worked and was so helpful. In retrospect seems so simple but I was having trouble working through how to do this. Will remember this approach in the future so thanks! — Amanda Regan, Nov 02 '20 at 16:33

score 1 · Answer 2 · answered Nov 02 '20 at 16:24

the simplest solution would be a join. Therefore you need a data.frame/tibble that has all the states an regions. Fortunately the data is already in base R:

library(dplyr)
# build the tibble of state abbrevitation and region from base R data
state_region <- dplyr::tibble(state.abb, state.region)
# join it on your data.frame/tibble
ORIGINAL_DATA %>% 
  dplyr::left_join(state_region, by = c("state" = "state.abb"))

Now you should have a new column called "state.region" that you can group by. Be aware that the states have to be in upper case.

Categorizing Data Based on Text Value in New Column

2 Answers2

Linked