regular expression to find exact matching containing a space and a punctuation

Question

I am going through a dataset containing text values (names) that are formatted like this example :

M.Joan (13-2)  
A.Alfred (20-13)  
F.O'Neil (12-231)  
D.Dan Fun (23-3)
T.Collins (51-82) J.Maddon (12-31)

Some strings have two names in it like

 M.Joan (13-2) A.Alfred (20-13)

I only want to extract the name from the string. Some names are easy to extract because they don't have spaces or anything. However some are hard because they have a space like the last one above.

name_pattern = "[A-Z][.][^ (]{1,}"
base <- str_extract_all(baseball1$Managers, name_pattern)

When I use this code to extract the names, it works well even for names with spaces or punctuations. However, the extracted names have a space at the end. I was wondering if I can find the exact pattern of " (", a space and a parenthesis.

Output:

[[1]]
[1] "Z.Taylor "

[[2]]
[1] "Z.Taylor "

[[3]]
[1] "Z.Taylor "

[[4]]
[1] "Z.Taylor "

[[5]]
[1] "Y.Berra "

[[6]]
[1] "Y.Berra "

Isn't it easier to remove final `(...)`? Use `sub("\\s*\$[^()]*\$\\s*$", "", baseball1$Managers)` — Wiktor Stribiżew, Sep 01 '17 at 12:08
@WiktorStribiżew that is easier however i am required to use str_extract_all, also, some of the names have two names like "T.Collins (51-82) J.Maddon (12-31)", and using that code would output: "T.Collins (51-82) and J.Maddon" — Maryjoan, Sep 01 '17 at 12:11

Wiktor Stribiżew · Accepted Answer · 2017-09-01T12:25:58.247

2

You may use

x <- c("M.Joan (13-2) ", "A.Alfred (20-13)", "F.O'Neil (12-231)", "D.Dan Fun (23-3)", "T.Collins (51-82) J.Maddon (12-31)", "T.Hillman (12-34) and N.Yost (23-45)")
regmatches(x, gregexpr("\\p{Lu}.*?(?=\\s*\\()", x, perl=TRUE))

See the regex demo

Or the str_extract_all version:

str_extract_all(baseball1$Managers, "\\p{Lu}.*?(?=\\s*\\()")

See the regex demo.

It matches

\p{Lu} - an uppercase letter
.*? - any char other than line break chars, as few as possible, up to the first occurrence of (but not including into the match, as (?=...) is a non-consuming construct)....
(?=\\s*\\() - positive lookahead that, immediately to the right of the current location, requires the presence of:
- \\s* - 0+ whitespace chars
- \\( - a literal (.

edited Sep 01 '17 at 12:25

answered Sep 01 '17 at 12:19

Wiktor Stribiżew

607,720
39
448
563

strange, using the str_extract_all that you provided, some outputs this "T.Hillman" "and N.Yost" – Maryjoan Sep 01 '17 at 12:23
It is not strange. If names start with uppercase letters always, just replace `\p{L}` with `\p{Lu}`. – Wiktor Stribiżew Sep 01 '17 at 12:24
Great. If you ever want to experiment with `gregexpr`, and you have Unicode letters in the names, it is safer to add `(*UCP)` at the pattern start: `"(*UCP)\\p{Lu}.*?(?=\\s*\\()"`. You do not need that in *stringr* method as ICU regex is already Unicode aware. To some extent. – Wiktor Stribiżew Sep 01 '17 at 12:27

regular expression to find exact matching containing a space and a punctuation

1 Answers1