1

I have a vector of strings each containing last and first name of one or more authors. I would like to extract the last names of each author in each string. What I know is that the name that comes first is always the last name of an author (the first author), and the last names of the other authors are everything that is between between a ; and a ,. For example, in the following string:

tutu <- "goulenok, tiphaine miquel; meune, christophe; gossec, laure; dougados, maxime; kahan, andre; allanore, yannick"

I would like to extract:

"goulenok" "meune" "gossec" "dougados" "kahan" "allanore"

The last name may include punctuation characters such as ' or - but always be between a ; and a ,

Any idea?

Waldir Leoncio
  • 10,853
  • 19
  • 77
  • 107
jejuba
  • 199
  • 1
  • 10
  • 1
    Is your input all in one character string (suggested by your code), or is it a vector of strings (suggested by your question)? – Blue Magister Jan 14 '13 at 21:23
  • it is a vector of strings actually. Thanks, – jejuba Jan 14 '13 at 21:30
  • it is a vector of strings actually. I won't extract the same number of author names and I'd need to put them in a data.frame in order of appearance. Thanks, – jejuba Jan 14 '13 at 21:40

3 Answers3

2
> sub(",.*$", "", strsplit(tutu, ";[ ]+")[[1]])
[1] "goulenok" "meune"    "gossec"   "dougados" "kahan"    "allanore"
Arun
  • 116,683
  • 26
  • 284
  • 387
1

Here is an approach that uses the gsubfn package:

library(gsubfn)

unlist(strapplyc(tutu, "(?:^|;) *([^,]+)"))
Greg Snow
  • 48,497
  • 6
  • 83
  • 110
0

This is a bit more blunt but also works:

sapply(unlist(lapply(strsplit(tutu, ";"), strsplit, ",")), "[", 1)
Tyler Rinker
  • 108,132
  • 65
  • 322
  • 519