29

I would like to patch some text data extracted from web pages. sample:

t="First sentence. Second sentence.Third sentence."

There is no space after the point at the end of the second sentence. This sign me that the 3rd sentence was in a separate line (after a br tag) in the original document.

I want to use this regexp to insert "\n" character into the proper places and patch my text. My regex:

t2=t.gsub(/([.\!?])([A-Z1-9])/,$1+"\n"+$2)

But unfortunately it doesn't work: "NoMethodError: undefined method `+' for nil:NilClass" How can I properly backreference to the matched groups? It was so easy in Microsoft Word, I just had to use \1 and \2 symbols.

Konstantin
  • 2,983
  • 3
  • 33
  • 55
  • 2
    The numbered globals (`$1`, `$2`, ...) aren't set when the second argument is evaluated, they're set by `gsub` before it yields to the block. Hence sawa's advice on when to use `'\1'` and when to use `$1`. – mu is too short Aug 22 '12 at 03:35

3 Answers3

33

You can backreference in the substitution string with \1 (to match capture group 1).

t = "First sentence. Second sentence.Third sentence!Fourth sentence?Fifth sentence."
t.gsub(/([.!?])([A-Z1-9])/, "\\1\n\\2") # => "First sentence. Second sentence.\nThird sentence!\nFourth sentence?\nFifth sentence."
Joshua Cheek
  • 30,436
  • 16
  • 74
  • 83
26
  • If you are using gsub(regex, replacement), then use '\1', '\2', ... to refer to the match. Make sure not to put double quotes around the replacement, or else escape the backslash as in Joshua's answer. The conversion from '\1' to the match will be done within gsub, not by literal interpretation.
  • If you are using gsub(regex){replacement}, then use $1, $1, ...

But for your case, it is easier not to use matches:

t2 = t.gsub(/(?<=[.\!?])(?=[A-Z1-9])/, "\n")
sawa
  • 165,429
  • 45
  • 277
  • 381
8

If you got here because of Rubocop complaining "Avoid the use of Perl-style backrefs." about $1, $2, etc... you can can do this instead:

some_id = $1
# or
some_id = Regexp.last_match[1] if Regexp.last_match

some_id = $5
# or
some_id = Regexp.last_match[5] if Regexp.last_match

It'll also want you to do

%r{//}.match(some_string)

instead of

some_string[//]

Lame (Rubocop)

Ben Wiseley
  • 537
  • 6
  • 14