Converting regex statement for sentence extraction to Ruby

Question

I found this regex statement at http://en.wikipedia.org/wiki/Sentence_boundary_disambiguation for Sentence boundary disambiguation, but am not able to use it in a Ruby split statment. I'm not too good with regex so maybe I am missing something? This is statment:

((?<=[a-z0-9)][.?!])|(?<=[a-z0-9][.?!]\"))(\s|\r\n)(?=\"?[A-Z])

and this is what I tried in Ruby, but no go:

text.split("((?<=[a-z0-9)][.?!])|(?<=[a-z0-9][.?!]\"))(\s|\r\n)(?=\"?[A-Z])")

Do you have an exemple to test your regex ? – Nicolas Guillaume May 01 '10 at 17:56 — Nicolas Guillaume, May 01 '10 at 17:56

score 2 · Accepted Answer · answered May 02 '10 at 02:58

This should work in Ruby 1.9, or in Ruby 1.8 if you compiled it with the Oniguruma regex engine (which is standard in Ruby 1.9):

result = text.split(/((?<=[a-z0-9)][.?!])|(?<=[a-z0-9][.?!]"))\s+(?="?[A-Z])/)

The difference is that your code passes a literal string to split(), while this code passes a literal regex.

It won't work using the classic Ruby regex engine (which is standard in Ruby 1.8) because it doesn't support lookbehind.

I also modified the regular expression. I replaced (\s|\r\n) with \s+. My regex also splits sentences that have multiple spaces between them (typing two spaces after a sentence is common in many cultures) and/or multiple line breaks between them (delimiting paragraphs).

When working with Unicode text, a further improvement would be to replace a-z with \p{Ll}\p{Lo}, A-Z with \p{Lu}\p{Lt}\p{Lo}, and 0-9 with \p{N} in the various character classes in your regex. The character class with punctuation symbols can be expaned similarly. That'll need a bit more research because there's no Unicode property for end-of-sentence punctuation.

Hi, thanks for the Oniguruma lead. I am trying to use the gem so I do not have to re-compile my ruby 1.8: http://oniguruma.rubyforge.org/. This seems to be working but I get nil if I do: reg = Oniguruma::ORegexp.new( '((?<=[a-z0-9)][.?!])|(?<=[a-z0-9][.?!]"))\s+(?="?[A-Z])' ) and then reg.scan(text). Should this way work? — DavidP6, May 04 '10 at 02:01
I test you regex in ruby 1.9.2 using the string "Just use google.com's search. Do you like bing? Or maybe use Yahoo instead.". I noticed that it produces elements with an empty space between each result. — Aris Bartee, Feb 22 '11 at 21:04
@ArisBartee: I saw the same thing. Could not figure out how to fix the regex. I removed the spaces like this: `result = text.split(/((?<=[a-z0-9)][.?!])|(?<=[a-z0-9][.?!]"))\s+(?="?[A-Z])/).reject { |s| s.empty? or s.nil? }` — squarism, Oct 17 '11 at 14:21

Converting regex statement for sentence extraction to Ruby

1 Answers1