6

I want to extract the first sentence from following with regex. The rule I want to implement (which I know won't be universal solution) is to extract from string start ^ up to (including) the first period/exclamation/question mark that is preceded by a lowercase letter or number.

require(stringr)

x = "Bali bombings: U.S. President George W. Bush amongst many others has condemned the perpetrators of the Bali car bombing of October 11. The death toll has now risen to at least 187."

My best guess so far has been to try and implement a non-greedy string-before-match approach fails in this case:

str_extract(x, '.+?(?=[a-z0-9][.?!] )')
[1] NA

Any tips much appreciated.

geotheory
  • 22,624
  • 29
  • 119
  • 196

2 Answers2

6

You put the [a-z0-9][.?!] into a non-consuming lookahead pattern, you need to make it consuming if you plan to use str_extract:

> str_extract(x, '.*?[a-z0-9][.?!](?= )')
[1] "Bali bombings: U.S. President George W. Bush amongst many others has condemned the perpetrators of the Bali car bombing of October 11."

See this regex demo.

Details

  • .*? - any 0+ chars other than line break chars
  • [a-z0-9] - an ASCII lowercase letter or a digit
  • [.?!] - a ., ? or !
  • (?= ) - that is followed with a literal space.

Alternatively, you may use sub:

sub("([a-z0-9][?!.])\\s.*", "\\1", x)

See this regex demo.

Details

  • ([a-z0-9][?!.]) - Group 1 (referred to with \1 from the replacement pattern): an ASCII lowercase letter or digit and then a ?, ! or .
  • \s - a whitespace
  • .* - any 0+ chars, as many as possible (up to the end of string).
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Doesn't work with sentences with `Mr.` and `Dr.` in the first sentence - eg. `Mr. Mahendra Prasad, a politician from Janata Dal (United) party, is a Member of the Parliament of India representing Bihar in the Rajya Sabha, the upper house of the Parliament . Second sentence etc.` – jaggi Apr 15 '18 at 18:23
  • I think it should handle titles to not to match the end of sentence - https://en.wikipedia.org/wiki/Title – jaggi Apr 15 '18 at 19:07
  • @jaggi Why do you add `{2}` to my solution? It is not in line with the current OP requirements. See *The rule I want to implement (**which I know won't be universal solution**) is to extract from string start `^` up to (including) the first period/exclamation/question mark that is preceded by a lowercase letter or number.* The sentence above are not the ones OP tries to match, so, yours is a different question. – Wiktor Stribiżew Apr 15 '18 at 19:23
  • ah, I see. I don't want to open a new question just for this, So I'll just put it for reference in comments if anybody has a requirement for handling 2 char titles like me to get a `valid` first sentence - `([a-z0-9]{2}[?!.])\s.*` – jaggi Apr 15 '18 at 19:32
3

corpus has special handling for abbreviations when determining sentence boundaries:

library(corpus)       
text_split(x, "sentences")
#>   parent index text                                                                                                                           
#> 1 1          1 Bali bombings: U.S. President George W. Bush amongst many others #> has condemned the perpetrators of the Bali car bombing of Oct…
#> 2 1          2 The death toll has now risen to at least 187.  

There's also useful dataset with common abbreviations for many languages including English. See corpus::abbreviations_en, which can be used for disambiguating the sentence boundaries.

dmi3kno
  • 2,943
  • 17
  • 31