Split a data frame column based on a comma

Question

I have a data frame with the following structure, titled "final_proj_data"

ID          County              Population     Year  
<dbl>       <chr>               <dbl>          <dbl>    
1003    Baldwin County, Alabama 169162         2006     
1015    Calhoun County, Alabama 112903         2006     
1043    Cullman County, Alabama 80187          2006     
1049    DeKalb County, Alabama  68014          2006

I am trying to split the column County into two different columns, County and State, and remove the comma.

I tried a number of permutations of the separate() function but I keep getting back this error:

Error: var must evaluate to a single number or a column name, not a character vector

I've tried (among others)

  final_proj_data %>% 
separate(final_proj_data$County, c("State", "County"), sep = ",", remove = TRUE)
    final_proj_data %>% 
separate(data = final_proj_data, col = County,
 into = c("State", "County"), sep = ",")

I'm not sure what I am doing wrong, or why the "col =" keeps throwing this error. Any help would be appreciated!

Try `final_proj_data %>% separate(County, into = c("County", "State"), sep = ",")` — Ronak Shah, Apr 28 '19 at 03:59
The problem with your code is that you're using `$` which is generally not suitable for use with data chained with the `dplyr` pipe` You therefore should replace `separate(final_proj_data$County,.....)` with just `separate(County,....)`. — NelsonGon, Apr 28 '19 at 05:21
@RonakShah, thanks! That worked! Sorry the question was kind of basic but I am new to R and chaining is not something I've used much in other programming. — FrenchConnections, Apr 28 '19 at 16:54

NelsonGon · Accepted Answer · 2019-04-28T05:19:33.077

Using dplyr and base R:

library(dplyr)
 final_proj_data %>% 
 mutate(State=unlist(lapply(strsplit(County,", "),function(x) x[2])),
       County=gsub(",.*","",County))
    ID         County Population Year   State
1 1003 Baldwin County     169162 2006 Alabama
2 1015 Calhoun County     112903 2006 Alabama
3 1043 Cullman County      80187 2006 Alabama
4 1049  DeKalb County      68014 2006 Alabama

Original:

With dplyr and tidyr(Just seen that @Ronak Shah commented the same above):

library(dplyr)
library(tidyr)
final_proj_data %>% 
   separate(County,c("County","State"),sep=",")
    ID         County    State Population Year
1 1003 Baldwin County  Alabama     169162 2006
2 1015 Calhoun County  Alabama     112903 2006
3 1043 Cullman County  Alabama      80187 2006
4 1049  DeKalb County  Alabama      68014 2006

score 2 · Answer 2 · answered Apr 28 '19 at 03:00

2

We can try using sub here for a base R option:

County <- sub(",.*$", "", final_proj_data$County)
State <- sub("^.*,\\s*", "", final_proj_data$County)
final_proj_data$County <- County
final_proj_data$State <- State

answered Apr 28 '19 at 03:00

Tim Biegeleisen

502,043
27
286
360

akrun · Answer 3 · 2019-04-28T03:07:03.973

We can do this in base R using read.csv

final_proj_data[c("County", "State")] <- read.csv(text = final_proj_data$County, 
              header = FALSE, stringsAsFactors = FALSE, strip.white = TRUE)
final_proj_data
#    ID         County Population Year   State
#1 1003 Baldwin County     169162 2006 Alabama
#2 1015 Calhoun County     112903 2006 Alabama
#3 1043 Cullman County      80187 2006 Alabama
#4 1049  DeKalb County      68014 2006 Alabama

data

final_proj_data <- structure(list(ID = c(1003L, 1015L, 1043L, 1049L), 
   County = c("Baldwin County, Alabama", 
"Calhoun County, Alabama", "Cullman County, Alabama", "DeKalb County, Alabama"
), Population = c(169162L, 112903L, 80187L, 68014L), Year = c(2006L, 
2006L, 2006L, 2006L)), class = "data.frame", row.names = c(NA, 
-4L))

jay.sf · Answer 4 · 2019-04-28T07:05:45.043

We can use strsplit in base R.

cbind(d, `colnames<-`(do.call(rbind, strsplit(d$County, ", ")), c("County", "State")))[-2]
#     ID Population Year         County   State
# 1 1003     169162 2006 Baldwin County Alabama
# 2 1015     112903 2006 Calhoun County Alabama
# 3 1043      80187 2006 Cullman County Alabama
# 4 1049      68014 2006  DeKalb County Alabama

Note: Use strsplit(as.character(d$County), ", ") if County is a factor column.

Data

d <- structure(list(ID = c("1003", "1015", "1043", "1049"), County = c("Baldwin County, Alabama", 
"Calhoun County, Alabama", "Cullman County, Alabama", "DeKalb County, Alabama"
), Population = c("169162", "112903", "80187", "68014"), Year = c("2006", 
"2006", "2006", "2006")), row.names = c(NA, -4L), class = "data.frame")

Split a data frame column based on a comma

4 Answers4

data