Split and Organize String of Names in R

Question

The Problem:

I am trying to split strings of unorganized baseball player lineups (10 players per lineup) into a data frame of ten organized columns. The problems I'm running into are 1) the order of the positions themselves isn't standard, and 2) the player names have variable formats - some names are "First Last" and others may be "First Last Jr".

The Question:

Is it possible to split these strings (LINEUPS) in a way to get to the organized data frame below (RESULT)?

Input:

LINEUPS <- c('OF Andrew Johnson P Victor Bailey OF Walter Hill 2B Carl Smith 3B Brian Rivera P Joseph Cox 1B Steven Parker SS William Gonzales OF Christopher Taylor C David Washington
',
'SS James Roberts P Dennis Flores OF Jason Torres 2B Jack Rodriguez OF Randy Baker P Edward Anderson C David Washington 3B Thomas Wilson OF Ryan Walker 1B Robert Harris Jr
',
'1B Howard Allen P Philip Hernandez OF Ryan Walker OF Christopher Taylor 2B Jack Rodriguez C Russell James 3B Brian Rivera P Joseph Cox OF Andrew Johnson SS Ralph Martinez
',
'OF Justin Adams P Dennis Flores 1B Jerry Gray P Donald Brooks OF Johnny Lopez 2B Alan Jackson Jr OF Sean Turner C Raymond Stewart SS Ralph Martinez 3B Thomas Wilson
',
'SS Arthur Foster 3B Timothy Mitchell P Joshua Watson OF Johnny Lopez P Edward Anderson C David Washington OF Justin Adams 1B Bruce Bell 2B Jack Rodriguez OF Sean Turner
',
'OF Willie Davis C David Washington P Philip Hernandez SS Ralph Martinez 3B Thomas Wilson OF Johnny Lopez 1B Howard Allen OF George Perez 2B Alan Jackson Jr P Eric Hall
',
'3B Timothy Mitchell P Edward Anderson OF Sean Turner OF Andrew Johnson P Victor Bailey C Paul Robinson SS Ralph Martinez 2B Carl Smith 1B Howard Allen OF Justin Adams
',
'3B Brian Rivera SS Mark Green Jr 1B Robert Harris Jr P Joshua Watson OF Christopher Taylor OF Patrick Perry OF John King 2B Peter Phillips C Terry Scott P Joseph Cox
',
'OF Lawrence Carter 2B Peter Phillips SS Arthur Foster 1B Matthew Campbell P Fred Nelson 3B Jesse Young OF Louis Powell OF Patrick Perry P Philip Hernandez C Terry Scott
',
'1B Jerry Gray OF Willie Davis 2B Alan Jackson Jr 3B Thomas Wilson C Wayne Barnes OF Louis Powell OF Randy Baker P Dennis Flores SS William Gonzales P Fred Nelson
')

Desired Result:

P1 <- c('Victor Bailey','Dennis Flores','Philip Hernandez','Dennis Flores','Joshua Watson','Philip Hernandez','Edward Anderson','Joshua Watson','Fred Nelson','Dennis Flores')
P2 <- c('Joseph Cox','Edward Anderson','Joseph Cox','Donald Brooks','Edward Anderson','Eric Hall','Victor Bailey','Joseph Cox','Philip Hernandez','Fred Nelson')
C <- c('David Washington','David Washington','Russell James','Raymond Stewart','David Washington','David Washington','Paul Robinson','Terry Scott','Terry Scott', 'Wayne Barnes')
"1B" <- c('Steven Parker','Robert Harris Jr', 'Howard Allen','Jerry Gray','Bruce Bell', 'Howard Allen', 'Howard Allen','Robert Harris Jr','Matthew Campbell','Jerry Gray')
"2B" <- c('Carl Smith','Jack Rodriguez','Jack Rodriguez','Alan Jackson Jr','Jack Rodriguez','Alan Jackson Jr','Carl Smith','Peter Phillips','Peter Phillips','Alan Jackson Jr')
"3B" <- c('Brian Rivera','Thomas Wilson','Brian Rivera','Thomas Wilson','Timothy Mitchell','Thomas Wilson','Timothy Mitchell','Brian Rivera','Jesse Young','Thomas Wilson')
SS <- c('William Gonzales','James Roberts','Ralph Martinez','Ralph Martinez','Arthur Foster','Ralph Martinez','Ralph Martinez','Mark Green Jr','Arthur Foster','William Gonzales')
OF1 <- c('Andrew Johnson','Jason Torres','Ryan Walker','Justin Adams','Johnny Lopez', 'Willie Davis','Sean Turner','Christopher Taylor','Lawrence Carter','Willie Davis')
OF2 <- c('Walter Hill','Randy Baker','Christopher Taylor','Johnny Lopez','Justin Adams','Johnny Lopez','Andrew Johnson','Patrick Perry','Louis Powell','Louis Powell')
OF3 <- c('Christopher Taylor','Ryan Walker','Andrew Johnson', 'Sean Turner', 'Sean Turner','George Perez','Justin Adams','John King','Patrick Perry','Randy Baker')

RESULT <- data.frame(P1, P2, C, `1B`, `2B`, `3B`, SS, OF1, OF2, OF3)

Any help and guidance is much appreciated. Thank you!

MrFlick · Accepted Answer · 2020-07-26T04:15:29.307

Here's a way to "tidy" up the data into a proper data.frame

mm <- gregexpr("\\b(P|C|OF|SS|1B|2B|3B)\\b", LINEUPS)
players <- do.call("rbind", unname(Map(function(x, m, i) {
  pstart <- m
  pend <- pstart + attr(m, "match.length")
  hstart <- pend + 1
  hend <- c(tail(pstart,-1)-1, nchar(x))
  data.frame(game=i, pos=substring(x, pstart, pend), name=substring(x, hstart, hend))
  
}, LINEUPS, mm, seq_along(LINEUPS))))
players$pos <- sub("^\\s|\\s+$","", players$pos)
players$name <- sub("^\\s|\\s+$","", players$name)

This looks for the special position values and extracts whatever text is after them till the next position value appears. We then trim off any white space that gives something that looks like

# head(players)
  game pos           name
1    1  OF Andrew Johnson
2    1   P  Victor Bailey
3    1  OF    Walter Hill
4    1  2B     Carl Smith
5    1  3B   Brian Rivera
6    1   P     Joseph Cox

Next we can do some manipulations with dplyr to get it into the shape you want.

library(dplyr)
library(tidyr)

players %>% 
  group_by(game, pos) %>% 
  mutate(pos=if_else(rep(n(),n())>1, paste0(pos, row_number()), pos)) %>% 
  pivot_wider(game, names_from=pos, values_from=name)

The mutate is a bit of a mess, but basically it just adds an index to the positions that have more than one occurrence. Then pivot_wider turns it from long format to wide format. It looks more like this now

    game OF1     P1      OF2     `2B`   `3B`   P2     `1B`   SS     OF3    C     
   <int> <chr>   <chr>   <chr>   <chr>  <chr>  <chr>  <chr>  <chr>  <chr>  <chr> 
 1     1 Andrew~ Victor~ Walter~ Carl ~ Brian~ Josep~ Steve~ Willi~ Chris~ David~
 2     2 Jason ~ Dennis~ Randy ~ Jack ~ Thoma~ Edwar~ Rober~ James~ Ryan ~ David~
 3     3 Ryan W~ Philip~ Christ~ Jack ~ Brian~ Josep~ Howar~ Ralph~ Andre~ Russe~
 4     4 Justin~ Dennis~ Johnny~ Alan ~ Thoma~ Donal~ Jerry~ Ralph~ Sean ~ Raymo~
 5     5 Johnny~ Joshua~ Justin~ Jack ~ Timot~ Edwar~ Bruce~ Arthu~ Sean ~ David~
 6     6 Willie~ Philip~ Johnny~ Alan ~ Thoma~ Eric ~ Howar~ Ralph~ Georg~ David~
 7     7 Sean T~ Edward~ Andrew~ Carl ~ Timot~ Victo~ Howar~ Ralph~ Justi~ Paul ~
 8     8 Christ~ Joshua~ Patric~ Peter~ Brian~ Josep~ Rober~ Mark ~ John ~ Terry~
 9     9 Lawren~ Fred N~ Louis ~ Peter~ Jesse~ Phili~ Matth~ Arthu~ Patri~ Terry~
10    10 Willie~ Dennis~ Louis ~ Alan ~ Thoma~ Fred ~ Jerry~ Willi~ Randy~ Wayne~

As far as name order goes, there's no way for a computer to know which is a first name and which is a last name unless you provide some sort of list.

That's awesome! The players data.frame is correctly pivoted in the RStudio console but is still in the long format in the data.frame itself. Is there a way to keep them both in the same wide format? Nevermind. Answered my own question. a simple players <- players %>% does the trick. LOL — Eric_Alan, Jul 26 '20 at 03:48
Well, just save the result of the transformation to a variable. Like `wideplayers <- players %>% ...` etc. — MrFlick, Jul 26 '20 at 03:49
I'm running into trouble with players that have initials the same as one of the positions: `SS J.P. Beard`, for example returning their name as only `J.`. How do I account for those situations? — Eric_Alan, Jul 26 '20 at 08:22
You could change the regular expression to `"\\b(P|C|OF|SS|1B|2B|3B)\\s"` to make sure there is a space immediately following the position. This should ignore the P in your example since it's followed by a period and not a space. — MrFlick, Jul 26 '20 at 23:14

Split and Organize String of Names in R

1 Answers1