0

I found this regex statement at http://en.wikipedia.org/wiki/Sentence_boundary_disambiguation for Sentence boundary disambiguation, but am not able to use it in a Ruby split statment. I'm not too good with regex so maybe I am missing something? This is statment:

((?<=[a-z0-9)][.?!])|(?<=[a-z0-9][.?!]\"))(\s|\r\n)(?=\"?[A-Z])

and this is what I tried in Ruby, but no go:

text.split("((?<=[a-z0-9)][.?!])|(?<=[a-z0-9][.?!]\"))(\s|\r\n)(?=\"?[A-Z])")
Jan Goyvaerts
  • 21,379
  • 7
  • 60
  • 72
DavidP6
  • 307
  • 8
  • 19

1 Answers1

2

This should work in Ruby 1.9, or in Ruby 1.8 if you compiled it with the Oniguruma regex engine (which is standard in Ruby 1.9):

result = text.split(/((?<=[a-z0-9)][.?!])|(?<=[a-z0-9][.?!]"))\s+(?="?[A-Z])/)

The difference is that your code passes a literal string to split(), while this code passes a literal regex.

It won't work using the classic Ruby regex engine (which is standard in Ruby 1.8) because it doesn't support lookbehind.

I also modified the regular expression. I replaced (\s|\r\n) with \s+. My regex also splits sentences that have multiple spaces between them (typing two spaces after a sentence is common in many cultures) and/or multiple line breaks between them (delimiting paragraphs).

When working with Unicode text, a further improvement would be to replace a-z with \p{Ll}\p{Lo}, A-Z with \p{Lu}\p{Lt}\p{Lo}, and 0-9 with \p{N} in the various character classes in your regex. The character class with punctuation symbols can be expaned similarly. That'll need a bit more research because there's no Unicode property for end-of-sentence punctuation.

Jan Goyvaerts
  • 21,379
  • 7
  • 60
  • 72
  • Hi, thanks for the Oniguruma lead. I am trying to use the gem so I do not have to re-compile my ruby 1.8: http://oniguruma.rubyforge.org/. This seems to be working but I get nil if I do: reg = Oniguruma::ORegexp.new( '((?<=[a-z0-9)][.?!])|(?<=[a-z0-9][.?!]"))\s+(?="?[A-Z])' ) and then reg.scan(text). Should this way work? – DavidP6 May 04 '10 at 02:01
  • I test you regex in ruby 1.9.2 using the string "Just use google.com's search. Do you like bing? Or maybe use Yahoo instead.". I noticed that it produces elements with an empty space between each result. – Aris Bartee Feb 22 '11 at 21:04
  • @ArisBartee: I saw the same thing. Could not figure out how to fix the regex. I removed the spaces like this: `result = text.split(/((?<=[a-z0-9)][.?!])|(?<=[a-z0-9][.?!]"))\s+(?="?[A-Z])/).reject { |s| s.empty? or s.nil? }` – squarism Oct 17 '11 at 14:21