separate different combinations of names to first and last using dplyr, tidyr, and regex

Question

Sample data frame:

name <- c("Smith John Michael","Smith, John Michael","Smith John, Michael","Smith-John Michael","Smith-John, Michael")
df <- data.frame(name)

df
                 name
1  Smith John Michael
2 Smith, John Michael
3 Smith John, Michael
4  Smith-John Michael
5 Smith-John, Michael

I need to achieve the following desired output:

                 name first.name  last.name
1  Smith John Michael       John      Smith
2 Smith, John Michael       John      Smith
3 Smith John, Michael    Michael Smith John
4  Smith-John Michael    Michael Smith-John
5 Smith-John, Michael    Michael Smith-John

The rules are: if there is a comma in the string, then anything before is the last name. the first word following the comma is first name. If no comma in string, first word is last name, second word is last name. hyphenated words are one word. I would rather acheive this with dplyr and regex but I'll take any solution. Thanks for the help

See http://stackoverflow.com/questions/7069076/split-column-at-delimiter-in-data-frame — Wiktor Stribiżew, Nov 01 '16 at 16:09

aichao · Accepted Answer · 2016-11-01T19:03:35.117

You can achieve your desired result using strsplit switching between splitting by "," or " " based on whether there is a comma or not in name. Here, we define two functions to make the presentation clearer. You can just as well inline the code within the functions.

get.last.name <- function(name) {
  lapply(ifelse(grepl(",",name),strsplit(name,","),strsplit(name," ")),`[[`,1)
}

The result of strsplit is a list. The lapply(...,'[[',1) loops through this list and extracts the first element from each list element, which is the last name.

get.first.name <- function(name) {
  d <- lapply(ifelse(grepl(",",name),strsplit(name,","),strsplit(name," ")),`[[`,2)
  lapply(strsplit(gsub("^ ","",d), " "),`[[`,1)
}

This function is similar except we extract the second element from each list element returned by strsplit, which contains the first name. We then remove any starting spaces using gsub, and we split again with " " to extract the first element from each list element returned by that strsplit as the first name.

Putting it all together with dplyr:

library(dplyr)
res <- df %>% mutate(first.name=get.first.name(name),
                     last.name=get.last.name(name))

The result is as expected:

print(res)
##                  name first.name  last.name
## 1  Smith John Michael       John      Smith
## 2 Smith, John Michael       John      Smith
## 3 Smith John, Michael    Michael Smith John
## 4  Smith-John Michael    Michael Smith-John
## 5 Smith-John, Michael    Michael Smith-John

Data:

df <- structure(list(name = c("Smith John Michael", "Smith, John Michael", 
"Smith John, Michael", "Smith-John Michael", "Smith-John, Michael"
)), .Names = "name", row.names = c(NA, -5L), class = "data.frame")
##                 name
##1  Smith John Michael
##2 Smith, John Michael
##3 Smith John, Michael
##4  Smith-John Michael
##5 Smith-John, Michael

score 0 · Answer 2 · answered Nov 02 '16 at 00:24

I am not sure if this is any better than aichao's answer but I gave it a shot anyway. I gives the right output.

df1 <- df %>% 
  filter(grepl(",",name)) %>%
  separate(name, c("last.name","first.middle.name"), sep = "\\,", remove=F) %>% 
  mutate(first.middle.name = trimws(first.middle.name)) %>%
  separate(first.middle.name, c("first.name","middle.name"), sep="\\ ",remove=T) %>%
  select(-middle.name)

df2 <- df %>%
  filter(!grepl(",",name)) %>%
  separate(name, c("last.name","first.name"), sep = "\\ ", remove=F)

df<-rbind(df1,df2)

separate different combinations of names to first and last using dplyr, tidyr, and regex

2 Answers2