To save others from thinking they can beat this problem - it can't be done without accepting either false positives or false negatives. To add to what James Curran said, you either declare Smith
the start of the sentence in We went to Dr. Smith's office.
, or you read This sentence is English. So is this one.
as a single sentence.
Next to those problems, different forms of abbreviations and Overeager Capitalization Of Every Word Can Kill Your Algorithm Or Regex.
That said, I might as well share the regexes I came up with.
The first regex is simple enough:
(?m)(?:^|[.!?][\t ]+)([A-Z]\S*)
It matches the start of a line or a .!?
This is followed by at least one tabs/whitespace, after which a capital letter is matched and the rest of the word (including dots to match abbreviations).
The first word of the sentence will be caught in group 1.
The second regex
(?m)[A-Z]\S*\.[^\S\r\n]+[A-Z]|(?:^|[.!?][\t ]+)([A-Z]\S*)
This is the previous regex, prepended with [A-Z]\S*\.[^\S\r\n]+[A-Z]|
. This part matches a word starting with a capital, followed by a dot, some whitespace and another capitalized character. Because the first part gets matched, the second part no longer tries to match it (explained in-depth here). The first word of the sentence will again be caught in group 1.
The first regex has false positives: it will wrongly match Smith
in the second half of the sentence We went to Dr. Smith's office.
The second regex has false negatives: it will fail to match So
in This is sentence is English. So is this one.
Test the regexes here.