0

I have a list of names like "Mark M. Owens, M.D., M.P.H." that I would like to sort to first name, last names and titles. With this data, titles always start after the first comma, if there is a title.

I am trying to sort the list into:

FirstName LastName Titles Mark Owens M.D.,M.P.H Lara Kraft - Dale Good C.P.A

Thanks in advance.

Here is my sample code:

namelist <- c("Mark M. Owens, M.D., M.P.H.", "Dale C. Good, C.P.A", "Lara T. Kraft" , "Roland G. Bass, III")
firstnames=sub('^?(\\w+)?.*$','\\1',namelist)
lastnames=sub('.*?(\\w+)\\W+\\w+\\W*?$', '\\1', namelist)
titles = sub('.*,\\s*', '', namelist)
names <- data.frame(firstnames , lastnames, titles )

You can see that with this code, Mr. Owens is not behaving. His title starts after the last comma, and the last name begins from P. You can tell that I referred to Extract last word in string in R, Extract 2nd to last word in string and Extract last word in a string after comma if there are multiple words else the first word

Community
  • 1
  • 1
user2627717
  • 344
  • 3
  • 14

2 Answers2

1

This should do the trick, at least on test data:

x=strsplit(namelist,split = ",")
x=rapply(object = x,function(x) gsub(pattern = "^ ",replacement = "",x =     x),how="replace")

names=sapply(x,function(y) y[[1]])
titles=sapply(x,function(y) if(length(unlist(y))>1){
    paste(na.omit(unlist(y)[2:length(unlist(y))]),collapse = ",")
}else{""})
names=strsplit(names,split=" ")
firstnames=sapply(names,function(y) y[[1]])
lastnames=sapply(names,function(y) y[[3]])

names <- data.frame(firstnames, lastnames, titles )
names

In cases like this, when the structure of strings is always the same, it is easier to use functions like strsplit() to extract desired parts

Maksim Gayduk
  • 1,051
  • 6
  • 13
1

You were off to a good start so you should pick up from there. The firstnames variable was good as written. For lastnames I used a modified name list. Inside of the sub function is another that eliminates everything after the first comma. The last name will then be the final word in the string. For titles there is a two-step process of first eliminating everything before the first comma, then replacing non-matched strings with a hyphen -.

namelist <- c("Mark M. Owens, M.D., M.P.H.", "Dale C. Good, C.P.A", "Lara T. Kraft" , "Roland G. Bass, III")
firstnames=sub('^?(\\w+)?.*$','\\1',namelist)
lastnames <- sub(".*?(\\w+)$", "\\1", sub(",.*", "", namelist), perl=TRUE)
titles <- sub(".*?,", "", namelist)
titles <- ifelse(titles == namelist, "-", titles)

names <- data.frame(firstnames , lastnames, titles )
  firstnames lastnames        titles
1       Mark     Owens  M.D., M.P.H.
2       Dale      Good         C.P.A
3       Lara     Kraft             -
4     Roland      Bass           III
Pierre L
  • 28,203
  • 6
  • 47
  • 69