regex: detect if character after an end of a sentence is in lowercase

Question

I want to detect if the first character after the end of a sentence is a lowercase.

For example:

Howdy world? lorem // match
Howdy world... lorem // match
Howdy world?   lorem // match
What is reality. howdy // match
Howdy you. Lorem // no match
Howdy you. 進撃の // no match

The end of a sentence is defined by these characters: .!?

What type of regex could I use to achieve this?

Do you want to match these end of sentence punctuation marks if they are followed with whitespace and a lowercase letter? — Wiktor Stribiżew, Apr 08 '16 at 11:37
Yes, although it can be more than one white space. Like the third example I have above. — Henrik Petterson, Apr 08 '16 at 11:37

Wiktor Stribiżew · Accepted Answer · 2016-04-08T12:37:09.210

2

To match these end of sentence punctuation marks if they are followed with whitespace and a lowercase letter, use

'~\w+[.?!]+\s+(?=\p{Ll})~u'

See the regex demo

Explanation:

\w+ - 1+ alphanumeric/underscore symbols
[.?!]+ - 1+ literal ., ? or !
\s+ - 1+ whitespace symbols...
(?=\p{Ll}) - followed with 1+ whitespace characters followed with a lowercase letter (see Unicode character properties for \p{Ll} details and more Unicode category classes).

In PHP, use the /u modifier since you are working with Unicode strings.

Here is a PHP code demo:

$re = '~\w+[.?!]+\s+(?=\p{Ll})~u'; 
$arr = array("Howdy world? lorem", "Howdy world... lorem", "Howdy world?   lorem", "What is reality. howdy ",
    "Howdy you. Lorem ", "Howdy you. 進撃の "); 
print_r(preg_grep($re, $arr));
// => Array([0] => Howdy world? lorem    [1] => Howdy world... lorem
//[2] => Howdy world?   lorem    [3] => What is reality. howdy )

edited Apr 08 '16 at 12:37

answered Apr 08 '16 at 11:38

Wiktor Stribiżew

607,720
39
448
563

This is exactly what I am looking for. Only one remaining question. When it finds a match, is it possible to highlight the ending word of the sentence (including the ending punctuation). See this example: https://regex101.com/r/gR5hB8/2 -- the term **world?** and **world...** etc should be highlighted. – Henrik Petterson Apr 08 '16 at 12:02
2

Maybe [`'~\w+[.?!](?=\s+\p{Ll})~u'`](https://regex101.com/r/gR5hB8/3). Or add a `\s*` after `\w+` if there can be whitespace between the word and the final punctuation. – Wiktor Stribiżew Apr 08 '16 at 12:12
YES! Exactly, is it possible to add the white space that follows the ending of the sentence as well, please see image for clarification: http://i.imgur.com/Zm6SPCE.jpg – Henrik Petterson Apr 08 '16 at 12:30
Just put it into the consuming part of the pattern: [`'~\w+[.?!]+\s+(?=\p{Ll})~u'`](https://regex101.com/r/gR5hB8/5). Note the `+` after `[.?!]+`. – Wiktor Stribiżew Apr 08 '16 at 12:33
1

I'll bounty this with 50 points once eligible because it appears it's more complicated than what I estimated. Somehow, while this matches on regex101, it doesn't seem to work in my own function. I'm working on a split sentence function, please see: http://ideone.com/wz6CH7 -- you can find your regex there (see comments). It should not split the sentence if your regex is a match, but seems like it does it anyway. – Henrik Petterson Apr 08 '16 at 13:28
See [the updated demo](http://ideone.com/7vbKEZ) where I moved the regex to the first position in the `$after_regexes` array. I do not think you need exactly that, but I guess that is something you need to check - the position of the regex inside the array. Also, recheck the meaning of `m` modifier: it redefines `^` and `$` meanings in PHP, while in Ruy that flag redefines `.` behavior (I remember that code was "translated" by ndn from Ruby into PHP). – Wiktor Stribiżew Apr 08 '16 at 13:53
Maybe I'm missing something but the updated demo is still not splitting the sentences correctly...? (incorrect link?) – Henrik Petterson Apr 08 '16 at 13:59
It is splitting the string into two strings. How should it work? I believe you need to update the question (although that is not a good thing, you know, to change topics like that) – Wiktor Stribiżew Apr 08 '16 at 14:02
Sorry about the lack of clarification, this is what's desired: http://i.imgur.com/coXMZWr.png – Henrik Petterson Apr 08 '16 at 14:04
That's a lot of regexes. This is actually quite a neat function! – Gary Woods Apr 08 '16 at 14:47
Thank you very much for posting this. I will mark this as correct as it solved the initial question that I asked! Btw, are you saying that I should strip all the "m" modifiers from the regexes? – Henrik Petterson Apr 09 '16 at 11:32
@Henrik I will have a look at the Ruby code and perhaps provide a comment to the original ndn's answer. Just it is not that easy to find time for that at the weekend. – Wiktor Stribiżew Apr 09 '16 at 12:28

npinti · Answer 2 · 2016-04-08T11:49:27.767

1

You could try using something like so: [.!?]\s*[a-z] (example available here).

This will match any lower case letter of the English alphabet which is after a ., !, or ? character and optionally white spaces.

edited Apr 08 '16 at 11:49

answered Apr 08 '16 at 11:38

npinti

51,780
5
72
96

regex: detect if character after an end of a sentence is in lowercase

2 Answers2