1

I have a data set of over 72,000 rows and need to extract one column (display_name) into two different columns (county and state) The data in the column looks like "Butler, (AL)" and I need it to separate.

Current Input

  cnty_fips display_name               Value
   <chr>     <chr>                      <dbl>
 1 02013     "\"Aleutians East, (AK)\""  125.
 2 02016     "\"Aleutians West, (AK)\""  172.
 3 02020     "\"Anchorage, (AK)\""       336.
 4 02050     "\"Bethel, (AK)\""          478.

Desired Output

  County          State
   <chr>           <chr>
1 Aleutians East  AL   
2 Aleutians West  AL 
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
Nichelle
  • 47
  • 6

2 Answers2

2

We can try using sub for a base R option:

df$County <- sub(",.*$", "", df$display_name)
df$State <- sub("^.*\\((.*?)\\)$", "\\1", df$display_name)
df

   cnty_fips         display_name Value         County State
1     02013 Aleutians East, (AK)   125 Aleutians East    AK
2     02016 Aleutians West, (AK)   172 Aleutians West    AK
3     02020      Anchorage, (AK)   336      Anchorage    AK
4     02050         Bethel, (AK)   478         Bethel    AK

Data:

df <- data.frame(cnty_fips=c("02013", "02016", "02020", "02050"),
                 display_name=c("Aleutians East, (AK)", "Aleutians West, (AK)",
                                "Anchorage, (AK)", "Bethel, (AK)"),
                 Value=c(125, 172, 336, 478), stringsAsFactors=FALSE)
Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360
1

You can use tidyr::extract and specify the regex that you want to extract :

tidyr::extract(df, display_name, c('County', 'State'), '(.*),\\s\\((.*)\\)')

#  cnty_fips         County State Value
#1      2013 Aleutians East    AK   125
#2      2016 Aleutians West    AK   172
#3      2020      Anchorage    AK   336
#4      2050         Bethel    AK   478

data

df <- structure(list(cnty_fips = c(2013L, 2016L, 2020L, 2050L), 
display_name = c("Aleutians East, (AK)", "Aleutians West, (AK)", 
"Anchorage, (AK)", "Bethel, (AK)"), Value = c(125, 172, 336, 478)), 
class = "data.frame", row.names = c(NA, -4L))
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • Yes! Thank you this was exactly what I was looking for! I was having trouble getting the right regex. – Nichelle Jul 24 '20 at 06:29