12

I'm a regexp noob and trying to split paragraphs into sentences. In my language we use quite a bit of abbreviations (like: bl.a.) in the middle of sentences, so I have come to the conclusion, that what I need to do is to look for punctuations, that are followed by a single space and then a word that starts with a capital letter like:

[sentence1]...anymore. However...[sentence2]

So a paragraph like:

Der er en lang og bevæget forhistorie bag lov om varsling m.v. i forbindelse med afskedigelser af større omfang. Det er ikke en bureaukratisk lovtekst blandt så mange andre.

Should end in this output:

[0] => Der er en lang og bevæget forhistorie bag lov om varsling m.v. i forbindelse med afskedigelser af større omfang.
[1] => Det er ikke en bureaukratisk lovtekst blandt så mange andre.

and NOT this:

[0] => Der er en lang og bevæget forhistorie bag lov om varsling m.v. 
[1] => i forbindelse med afskedigelser af større omfang.
[2] => Det er ikke en bureaukratisk lovtekst blandt så mange andre.

I have found a solution that does the first part of this with the positive lookbehind feature:

$regexp = (?<=[.!?] | [.!?][\'"]);

and then

$sentences = preg_split($regexp, $paragraph, -1, PREG_SPLIT_NO_EMPTY);

which is a great starting point, but splits way too many times because of the many abbreviations.

I have tried to do this:

(?<=[.!?]\s[A-Z] | [.!?][\'"])

to target every occurance of either

. or ! or ?

followed by a space and a capital letter, but that did not work.

Does anyone know, if there is a way to accomplish what I am trying to do?

hippietrail
  • 15,848
  • 18
  • 99
  • 158
acrmuui
  • 2,040
  • 1
  • 22
  • 33
  • So you want to create a newline break whenever the criteria of 'This. Is'? – Daryl Gill Apr 06 '13 at 16:19
  • Not necessarily, I'm quite satisfied with the output format of the preg_split PHP function. What I struggle with is writing the regexp that looks for the 'This. Is' criteria. – acrmuui Apr 06 '13 at 16:23
  • Hi, thanks for answering. I have actually read through those exact answers before posting, but I could not find any of them, that searches for the exact pattern of a punctiation followed by a space followed by a word that starts with a capital letter. Or am I missing something? – acrmuui Apr 06 '13 at 16:42
  • Hi Ka, I have updated the question with an example of the output I am lookin for. – acrmuui Apr 06 '13 at 16:52
  • 2
    @ka: No, this question is not a duplicate of the linked question. – Madara's Ghost Apr 06 '13 at 17:17
  • by the RegExps you're using I see you have/want support for quotes ["\'], you also need this? can you provide an example where you want to split by quotes, and one you don't – CSᵠ Apr 06 '13 at 17:17
  • Thank you very much for the help Ka, your answer works perfectly. You are right I do need support for quotes, but I think I figured that part out after seeing your solution. The regexp now looks like this = (?<=[.?!;]|[.?!;][\'"])\s+(?=\p{Lu}). Does that look somewhat right to you? – acrmuui Apr 06 '13 at 17:21
  • 1
    @acrmuui yes, looks good but you don't have quote usage in the example you posted – CSᵠ Apr 06 '13 at 18:19

2 Answers2

16

Unicode RegExp for splitting sentences: (?<=[.?!;])\s+(?=\p{Lu})

Explained demo here: http://regex101.com/r/iR7cC8

CSᵠ
  • 10,049
  • 9
  • 41
  • 64
  • "Unicode" here is misleading. This regex does makes use of Unicode Character Properties, but this does **not** implement the Unicode Sentence Boundary rules as defined by UAX 29. – NikiC Apr 06 '13 at 17:41
  • @NikiC it's not foolproof indeed but UAX29 also notes: *...implementations are free to override (tailor) the results to meet the requirements ...* – CSᵠ Apr 06 '13 at 18:18
  • 1
    Doesn't work on "e.g." though and "2. text here". It shouldn't split here – tjvg1991 Apr 24 '18 at 04:05
  • @tjvg1991 indeed, this is just a generic solution, you can add those special cases over the regex – CSᵠ May 05 '18 at 16:23
3

Searching for such a pattern still seems unreliable but as sentences may be ended by line returns I would try just the following

[.\!\?][\s\n\r\t][A-Z] 

I don't think you actually meant for the look-ahead's do you? ( !? together, so using the \ escapes it - tells the regex ignore any special meaning )

Nick Cardoso
  • 20,807
  • 14
  • 73
  • 124