1

I'm using PHP's preg_split() to split paragraphs into sentences. Here's the regex I'm using:

(?<=[\.\?\!]|(\."))\s(?=[A-Z\s\b])

It should match spaces preceded by a punctuation and followed by either a space or a capital letter. However, it isn't matching cases like this:

A "word. ".

I'm expecting it to split this into 2 parts: A "word. and "., but it's not matching. How do I fix the regex?

Leo Jiang
  • 24,497
  • 49
  • 154
  • 284
  • 2
    It's easy enough to write a regex that will split by "a space preceded by a punctuation and followed by either a space or capital letter" however this won't always constitute the beginning of a sentence. Common exception: abbreviations. The short answer to "how can I split paragraphs into sentences" is that regex alone cannot accomplish this. – CrayonViolent Feb 23 '14 at 04:07
  • Agreed. Lexical analysis will be necessary. –  Feb 23 '14 at 04:09
  • sidenote: `".` is not valid grammar. For periods, if a sentence ends in a quote, the period should be inside the quote. So not only are you asking for regex to parse proper grammar (which it can't), you're asking for it to be lenient on improper grammar. – CrayonViolent Feb 23 '14 at 04:16
  • Note that inside the character class `[]` you don't need (should not use) a backslash. – Floris Feb 23 '14 at 04:16
  • @CrayonViolent - punctuation can be inside our outside the quote, depending on the context and the geographical location: `"Look out!" he said.`, but `They said they were "just friends".` One source: https://www.grammarbook.com/punctuation/quotes.asp . A contradicting source: http://www.grammar-monster.com/lessons/quotation_(speech)_marks_punctuation_in_or_out.htm . The latter seems more convincing. For example: `Did she really say "I love you"?` _clearly_ should have the question mark outside the quotes, since she didn't say "I love you?" but the quote is embedded in a question. – Floris Feb 23 '14 at 04:22
  • Evidently, this won't be perfect. I just want this to match as many cases as possible. The only common case that this does not match is middle initials. – Leo Jiang Feb 23 '14 at 04:26
  • @Floris yes, it depends on the punctuation vs. context. But some of your examples are wrong, and periods should *always* be on the inside if the sentence is ending in quote. – CrayonViolent Feb 23 '14 at 04:27
  • @CrayonViolent - your statement "periods should _always_ be on the inside if the sentence is ending in quote" is wrong. You are assuming US grammar. "English" is a language spoken outside of the United States, and the rules of grammar vary by region. It is absolutely OK to have the period outside of quotation marks in certain circumstances. Again, see http://www.grammar-monster.com/lessons/quotation_(speech)_marks_punctuation_in_or_out.htm for some examples. Or see http://english.stackexchange.com/a/39/63462 which references both Guardian and Economist style guides. "English" sources... – Floris Feb 23 '14 at 04:37
  • 1
    @Floris okay well I will concede to the point that locale plays a role in rules. All the more reason why regex is not the right tool for the job! – CrayonViolent Feb 23 '14 at 04:49
  • @CrayonViolent - OK, peace. – Floris Feb 23 '14 at 04:53

3 Answers3

1

Since you have acknowledged it can't be perfect, here's a regex that should "work" for you:

$paragraph = 'This is a sentence. "More sentence." Another? "MORE". Many more. She said "how do you do?" and I said "wtf".';
$sentences = preg_split('~([a-zA-Z]([.?!]"?|"?[.?!]))\K\s+(?=[A-Z"])~',$paragraph);

print_r($sentences);

output:

Array
(
    [0] => This is a sentence.
    [1] => "More sentence."
    [2] => Another?
    [3] => "MORE".
    [4] => Many more.
    [5] => She said "how do you do?" and I said "wtf".
)
CrayonViolent
  • 32,111
  • 5
  • 56
  • 79
0

Your regex can't match your provided example.

You want to match on A "word. ". with your regex. Now there are two spaces the regex may match on:

A "word. ".
 ^      ^

Your regex means:

one space, preceeded by either [.?!] or ." (literally) (1) and followed by either a capital letter or another space ([A-Z\s\b]) (2)

Now the first space is preceeded by a capital letter, hence this won't match according to 1.

The second space is preceeded by dot, so it is a candidate to be matched, but it isn't followed by a capital letter or another space (according to 2), thus there is not match.

The easiest way to fix this is to simply add " to your look-ahead:

(?<=[.?!]|(\."))\s(?=[A-Z\s\b"])
                             ^

But for splitting paragraphs into sentences I doubt this will be sufficient, as the comments already point out.

KeyNone
  • 8,745
  • 4
  • 34
  • 51
  • It's getting late and I thought I was matching `[A-Z\s\b]` after the period (instead of the space). Thanks! – Leo Jiang Feb 23 '14 at 04:46
0

The following expression seems pretty good:

$arr = preg_split('#(?<=[.?!](\s|"))\s?(?=[A-Z\b"])#',$str);

I tested it on

When my friend said he likes deep dish pizza one day, I immediately set a time to come back to Little Star. Arguably, the best deep dish pizza in SF...though...I don't believe there are many places that do deep dish pizza. That being said...its not the BEST ever, just the best "for the area." They use cornmeal in the crust, or on the baking surface, so there's a bit of extra crunch to it. That being said...I'm not sure how much I like the cornmeal texture to my pizza. I kind of want just a GOOD CRUST, you know? No extra stuff to try to make it more crunchy.

Outcome:

Array
(
    [0] => When my friend said he likes deep dish pizza one day, I immediately set a time to come back to Little Star. 
    [1] => Arguably, the best deep dish pizza in SF...though...I don't believe there are many places that do deep dish pizza. 
    [2] => That being said...its not the BEST ever, just the best "for the area."
    [3] => They use cornmeal in the crust, or on the baking surface, so there's a bit of extra crunch to it. 
    [4] => That being said...I'm not sure how much I like the cornmeal texture to my pizza. 
    [5] => I kind of want just a GOOD CRUST, you know? 
    [6] => No extra stuff to try to make it more crunchy.
)

However, it will fail when you do something like

I met Ms. Scarlet in the library.

As the . S will be interpreted as your "definition of a new line".

Floris
  • 45,857
  • 6
  • 70
  • 122