Extract first sentence in string

Question

I want to extract the first sentence from following with regex. The rule I want to implement (which I know won't be universal solution) is to extract from string start ^ up to (including) the first period/exclamation/question mark that is preceded by a lowercase letter or number.

require(stringr)

x = "Bali bombings: U.S. President George W. Bush amongst many others has condemned the perpetrators of the Bali car bombing of October 11. The death toll has now risen to at least 187."

My best guess so far has been to try and implement a non-greedy string-before-match approach fails in this case:

str_extract(x, '.+?(?=[a-z0-9][.?!] )')
[1] NA

Any tips much appreciated.

Try [`sub("([a-z0-9][?!.]).*", "\\1", x)`](https://regex101.com/r/VxgRek/1) — Wiktor Stribiżew, Feb 20 '18 at 12:02
Thanks Wiktor. That certainly works with this example. Why is the last `1` of `October 11.` not also matched? — geotheory, Feb 20 '18 at 12:05
Not sure what you mean, you only need to check for a single digit or lowercase letter, right? — Wiktor Stribiżew, Feb 20 '18 at 12:06

score 6 · Accepted Answer · answered Feb 20 '18 at 12:08

6

You put the [a-z0-9][.?!] into a non-consuming lookahead pattern, you need to make it consuming if you plan to use str_extract:

> str_extract(x, '.*?[a-z0-9][.?!](?= )')
[1] "Bali bombings: U.S. President George W. Bush amongst many others has condemned the perpetrators of the Bali car bombing of October 11."

See this regex demo.

Details

.*? - any 0+ chars other than line break chars
[a-z0-9] - an ASCII lowercase letter or a digit
[.?!] - a ., ? or !
(?= ) - that is followed with a literal space.

Alternatively, you may use sub:

sub("([a-z0-9][?!.])\\s.*", "\\1", x)

See this regex demo.

Details

([a-z0-9][?!.]) - Group 1 (referred to with \1 from the replacement pattern): an ASCII lowercase letter or digit and then a ?, ! or .
\s - a whitespace
.* - any 0+ chars, as many as possible (up to the end of string).

answered Feb 20 '18 at 12:08

Wiktor Stribiżew

607,720
39
448
563

Doesn't work with sentences with `Mr.` and `Dr.` in the first sentence - eg. `Mr. Mahendra Prasad, a politician from Janata Dal (United) party, is a Member of the Parliament of India representing Bihar in the Rajya Sabha, the upper house of the Parliament . Second sentence etc.` – jaggi Apr 15 '18 at 18:23
I think it should handle titles to not to match the end of sentence - https://en.wikipedia.org/wiki/Title – jaggi Apr 15 '18 at 19:07
@jaggi Why do you add `{2}` to my solution? It is not in line with the current OP requirements. See *The rule I want to implement (**which I know won't be universal solution**) is to extract from string start `^` up to (including) the first period/exclamation/question mark that is preceded by a lowercase letter or number.* The sentence above are not the ones OP tries to match, so, yours is a different question. – Wiktor Stribiżew Apr 15 '18 at 19:23
ah, I see. I don't want to open a new question just for this, So I'll just put it for reference in comments if anybody has a requirement for handling 2 char titles like me to get a `valid` first sentence - `([a-z0-9]{2}[?!.])\s.*` – jaggi Apr 15 '18 at 19:32

score 3 · Answer 2 · answered Feb 20 '18 at 13:19

corpus has special handling for abbreviations when determining sentence boundaries:

library(corpus)       
text_split(x, "sentences")
#>   parent index text                                                                                                                           
#> 1 1          1 Bali bombings: U.S. President George W. Bush amongst many others #> has condemned the perpetrators of the Bali car bombing of Oct…
#> 2 1          2 The death toll has now risen to at least 187.

There's also useful dataset with common abbreviations for many languages including English. See corpus::abbreviations_en, which can be used for disambiguating the sentence boundaries.

Very useful. I've been hacking through horrible regex to deal with `Dr.` etc — geotheory, Feb 20 '18 at 14:23

Extract first sentence in string

2 Answers2