What makes this a bit difficult is that you can't nest a forward lookahead operator inside of a clause followed by {2,3}
. Unfortunately, the best I can do is put this together longhand.
stringr::str_extract_all(test_string,"(?<!([A-Z][^ ]{0,20} ))([A-Z][^ ,.]*)[ ,.]([A-Z][^ ,.]*)([ ,.]([A-Z][^ ,.]*))?(?=([ ,.]|$))(?!( [A-Z]))")
Results:
[[1]]
[1] "Andrew Smith" "Samuel L Jackson" "DEREK JETER" "MIKE NELSON TROUT"
This used negative lookbehind, forward lookahead, and negative forward lookahead to identify whether the words are followed by other capitals. Explanation is below and is partially spread out for legibility.
# Negative lookback to make sure there wasn't a word starting with a capital and having up to 20
# characters before the first word in our sequence.
# Note: Lookbehind requires a bounded possibility set such as {,} and won't work with * or +
(?<!([A-Z][^ ]{0,20} )
# A word starting with a capital, followed by 0 or more characters that aren't a space, period,
# or comma.
([A-Z][^ ,.]*)
# A space a period or a comma.
[ ,.]
# A word starting with a capital, followed by 0 or more characters that aren't a space, period, or
# comma.
([A-Z][^ ,.]*)
# Maybe a third word indicated by a space/period/comma followed by a word starting with a
# capital...
([ ,.]([A-Z][^ ,.]*))?
# Forward lookahead to make sure the last character in the capture is followed by a space, comma,
# period, or end of line character. (Don't cut words in half)
(?=([ ,.]|$))
# Negative forward lookahead to make sure there isn't another word starting with a capital after
# our word sequence.
(?!( [A-Z]))