2

I am going through a dataset containing text values (names) that are formatted like this example :

M.Joan (13-2)  
A.Alfred (20-13)  
F.O'Neil (12-231)  
D.Dan Fun (23-3)
T.Collins (51-82) J.Maddon (12-31)

Some strings have two names in it like

 M.Joan (13-2) A.Alfred (20-13)

I only want to extract the name from the string. Some names are easy to extract because they don't have spaces or anything. However some are hard because they have a space like the last one above.

name_pattern = "[A-Z][.][^ (]{1,}"
base <- str_extract_all(baseball1$Managers, name_pattern)

When I use this code to extract the names, it works well even for names with spaces or punctuations. However, the extracted names have a space at the end. I was wondering if I can find the exact pattern of " (", a space and a parenthesis.

Output:

[[1]]
[1] "Z.Taylor "

[[2]]
[1] "Z.Taylor "

[[3]]
[1] "Z.Taylor "

[[4]]
[1] "Z.Taylor "

[[5]]
[1] "Y.Berra "

[[6]]
[1] "Y.Berra "
Wilcar
  • 2,349
  • 2
  • 21
  • 48
Maryjoan
  • 23
  • 3
  • Isn't it easier to remove final `(...)`? Use `sub("\\s*\\([^()]*\\)\\s*$", "", baseball1$Managers)` – Wiktor Stribiżew Sep 01 '17 at 12:08
  • @WiktorStribiżew that is easier however i am required to use str_extract_all, also, some of the names have two names like "T.Collins (51-82) J.Maddon (12-31)", and using that code would output: "T.Collins (51-82) and J.Maddon" – Maryjoan Sep 01 '17 at 12:11
  • Try `name_pattern = "[A-Z][.][^\\s(]{1,}"` – Christoph Wolk Sep 01 '17 at 12:32

1 Answers1

2

You may use

x <- c("M.Joan (13-2) ", "A.Alfred (20-13)", "F.O'Neil (12-231)", "D.Dan Fun (23-3)", "T.Collins (51-82) J.Maddon (12-31)", "T.Hillman (12-34) and N.Yost (23-45)")
regmatches(x, gregexpr("\\p{Lu}.*?(?=\\s*\\()", x, perl=TRUE))

See the regex demo

Or the str_extract_all version:

str_extract_all(baseball1$Managers, "\\p{Lu}.*?(?=\\s*\\()")

See the regex demo.

It matches

  • \p{Lu} - an uppercase letter
  • .*? - any char other than line break chars, as few as possible, up to the first occurrence of (but not including into the match, as (?=...) is a non-consuming construct)....
  • (?=\\s*\\() - positive lookahead that, immediately to the right of the current location, requires the presence of:
    • \\s* - 0+ whitespace chars
    • \\( - a literal (.
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • strange, using the str_extract_all that you provided, some outputs this "T.Hillman" "and N.Yost" – Maryjoan Sep 01 '17 at 12:23
  • It is not strange. If names start with uppercase letters always, just replace `\p{L}` with `\p{Lu}`. – Wiktor Stribiżew Sep 01 '17 at 12:24
  • Great. If you ever want to experiment with `gregexpr`, and you have Unicode letters in the names, it is safer to add `(*UCP)` at the pattern start: `"(*UCP)\\p{Lu}.*?(?=\\s*\\()"`. You do not need that in *stringr* method as ICU regex is already Unicode aware. To some extent. – Wiktor Stribiżew Sep 01 '17 at 12:27