-1

I want to detect if the first character after the end of a sentence is a lowercase.

For example:

Howdy world? lorem // match
Howdy world... lorem // match
Howdy world?   lorem // match
What is reality. howdy // match
Howdy you. Lorem // no match
Howdy you. 進撃の // no match

The end of a sentence is defined by these characters: .!?

What type of regex could I use to achieve this?

Henrik Petterson
  • 6,862
  • 20
  • 71
  • 155

2 Answers2

2

To match these end of sentence punctuation marks if they are followed with whitespace and a lowercase letter, use

'~\w+[.?!]+\s+(?=\p{Ll})~u'

See the regex demo

Explanation:

  • \w+ - 1+ alphanumeric/underscore symbols
  • [.?!]+ - 1+ literal ., ? or !
  • \s+ - 1+ whitespace symbols...
  • (?=\p{Ll}) - followed with 1+ whitespace characters followed with a lowercase letter (see Unicode character properties for \p{Ll} details and more Unicode category classes).

In PHP, use the /u modifier since you are working with Unicode strings.

Here is a PHP code demo:

$re = '~\w+[.?!]+\s+(?=\p{Ll})~u'; 
$arr = array("Howdy world? lorem", "Howdy world... lorem", "Howdy world?   lorem", "What is reality. howdy ",
    "Howdy you. Lorem ", "Howdy you. 進撃の "); 
print_r(preg_grep($re, $arr));
// => Array([0] => Howdy world? lorem    [1] => Howdy world... lorem
//[2] => Howdy world?   lorem    [3] => What is reality. howdy )
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • This is exactly what I am looking for. Only one remaining question. When it finds a match, is it possible to highlight the ending word of the sentence (including the ending punctuation). See this example: https://regex101.com/r/gR5hB8/2 -- the term **world?** and **world...** etc should be highlighted. – Henrik Petterson Apr 08 '16 at 12:02
  • 2
    Maybe [`'~\w+[.?!](?=\s+\p{Ll})~u'`](https://regex101.com/r/gR5hB8/3). Or add a `\s*` after `\w+` if there can be whitespace between the word and the final punctuation. – Wiktor Stribiżew Apr 08 '16 at 12:12
  • YES! Exactly, is it possible to add the white space that follows the ending of the sentence as well, please see image for clarification: http://i.imgur.com/Zm6SPCE.jpg – Henrik Petterson Apr 08 '16 at 12:30
  • Just put it into the consuming part of the pattern: [`'~\w+[.?!]+\s+(?=\p{Ll})~u'`](https://regex101.com/r/gR5hB8/5). Note the `+` after `[.?!]+`. – Wiktor Stribiżew Apr 08 '16 at 12:33
  • 1
    I'll bounty this with 50 points once eligible because it appears it's more complicated than what I estimated. Somehow, while this matches on regex101, it doesn't seem to work in my own function. I'm working on a split sentence function, please see: http://ideone.com/wz6CH7 -- you can find your regex there (see comments). It should not split the sentence if your regex is a match, but seems like it does it anyway. – Henrik Petterson Apr 08 '16 at 13:28
  • See [the updated demo](http://ideone.com/7vbKEZ) where I moved the regex to the first position in the `$after_regexes` array. I do not think you need exactly that, but I guess that is something you need to check - the position of the regex inside the array. Also, recheck the meaning of `m` modifier: it redefines `^` and `$` meanings in PHP, while in Ruy that flag redefines `.` behavior (I remember that code was "translated" by ndn from Ruby into PHP). – Wiktor Stribiżew Apr 08 '16 at 13:53
  • Maybe I'm missing something but the updated demo is still not splitting the sentences correctly...? (incorrect link?) – Henrik Petterson Apr 08 '16 at 13:59
  • It is splitting the string into two strings. How should it work? I believe you need to update the question (although that is not a good thing, you know, to change topics like that) – Wiktor Stribiżew Apr 08 '16 at 14:02
  • Sorry about the lack of clarification, this is what's desired: http://i.imgur.com/coXMZWr.png – Henrik Petterson Apr 08 '16 at 14:04
  • That's a lot of regexes. This is actually quite a neat function! – Gary Woods Apr 08 '16 at 14:47
  • Thank you very much for posting this. I will mark this as correct as it solved the initial question that I asked! Btw, are you saying that I should strip all the "m" modifiers from the regexes? – Henrik Petterson Apr 09 '16 at 11:32
  • @Henrik I will have a look at the Ruby code and perhaps provide a comment to the original ndn's answer. Just it is not that easy to find time for that at the weekend. – Wiktor Stribiżew Apr 09 '16 at 12:28
1

You could try using something like so: [.!?]\s*[a-z] (example available here).

This will match any lower case letter of the English alphabet which is after a ., !, or ? character and optionally white spaces.

npinti
  • 51,780
  • 5
  • 72
  • 96