How to embed regular expressions in other regular expressions in Ruby

Question

I have a string:

'A Foo'

and want to find "Foo" in it.

I have a regular expression:

/foo/

that I'm embedding into another case-insensitive regular expression, so I can build the pattern in steps:

foo_regex = /foo/
pattern = /A #{ foo_regex }/i

But it won't match correctly:

'A Foo' =~ pattern # => nil

If I embed the text directly into the pattern it works:

'A Foo' =~ /A foo/i # => 0

What's wrong?

Tin Man gives good advice. Ruby is a Perl wanna-be. As such it compiles every regex with a scoped cluster group construct (such as `(?misx-misx:)` ). So this syntax `/ string1 /` compiles a regex with _no options_, and default's to `(?-misx:string 1)` Another regex `/ string2 /i` compiled with the `i` flag, adds it to the plus side `(?i-msx: string 2 )`. Since cluster groups are _scoped_ the options _inside_ the cluster take precedence. So, regex 1 inside regex 2 is `(?i-msx: string 2 (?-misx: string 1))` and of course scope dictates `string 1` is now case sensitive. — , Mar 27 '17 at 23:44
I wouldn't say Ruby is a Perl wanna-be. It's more like Java, Perl and some other languages blended their parts and came up with Ruby. — the Tin Man, Mar 28 '17 at 01:08
@theTinMan: Perl, Smalltalk or Lisp, yes. But Java? I really don't see many similarities. — Eric Duminil, Mar 28 '17 at 07:40

the Tin Man · Answer 1 · 2017-03-28T01:14:11.723

On the surface it seems that embedding a pattern inside another pattern would simply work, but that's based on a bad assumption of how patterns work in Ruby, that they're simply strings. Using:

foo_regex = /foo/

creates a Regexp object:

/foo/.class # => Regexp

As such it has knowledge of the optional flags used to create it:

( /foo/    ).options # => 0
( /foo/i   ).options # => 1
( /foo/x   ).options # => 2
( /foo/ix  ).options # => 3
( /foo/m   ).options # => 4
( /foo/im  ).options # => 5
( /foo/mx  ).options # => 6
( /foo/imx ).options # => 7

or, if you like binary:

'%04b' % ( /foo/    ).options # => "0000"
'%04b' % ( /foo/i   ).options # => "0001"
'%04b' % ( /foo/x   ).options # => "0010"
'%04b' % ( /foo/xi  ).options # => "0011"
'%04b' % ( /foo/m   ).options # => "0100"
'%04b' % ( /foo/mi  ).options # => "0101"
'%04b' % ( /foo/mx  ).options # => "0110"
'%04b' % ( /foo/mxi ).options # => "0111"

and remembers those whenever the Regexp is used, whether as a standalone pattern or if embedded in another.

You can see this in action if we look to see what the pattern looks like after embedding:

/#{ /foo/  }/ # => /(?-mix:foo)/
/#{ /foo/i }/ # => /(?i-mx:foo)/

?-mix: and ?i-mx: are how those options are represented in an embedded-pattern.

According to the Regexp documentation for Options:

i, m, and x can also be applied on the subexpression level with the (?on-off) construct, which enables options on, and disables options off for the expression enclosed by the parentheses.

So, Regexp is remembering those options, even inside the outer pattern, causing the overall pattern to fail the match:

pattern = /A #{ foo_regex }/i # => /A (?-mix:foo)/i
'A Foo' =~ pattern # => nil

It's possible to make sure that all sub-expressions match their surrounding patterns, however that can quickly become too convoluted or messy:

foo_regex = /foo/i
pattern = /A #{ foo_regex }/i # => /A (?i-mx:foo)/i
'A Foo' =~ pattern # => 0

Instead we have the source method which returns the text of a pattern:

/#{ /foo/.source  }/ # => /foo/
/#{ /foo/i.source }/ # => /foo/

The problem with the embedded pattern remembering the options also appears when using other Regexp methods, such as union:

/#{ Regexp.union(%w[a b]) }/ # => /(?-mix:a|b)/

and again, source can help:

/#{ Regexp.union(%w[a b]).source }/ # => /a|b/

Knowing all that:

foo_regex = /foo/
pattern = /#{ foo_regex.source }/i # => /foo/i
'A Foo' =~ pattern # => 2

http://stackoverflow.com/a/42729953/128421, http://stackoverflow.com/a/16705515/128421, http://stackoverflow.com/a/23701327/128421 and http://stackoverflow.com/a/38154742/128421 for additional information about this. — the Tin Man, Mar 28 '17 at 01:41

Stefan · Answer 2 · 2017-09-08T09:45:00.157

"what's wrong?"

Your assumption on how a Regexp is interpolated is wrong.

Interpolation via #{...} is done by calling to_s on the interpolated object:

d = Date.new(2017, 9, 8)
#=> #<Date: 2017-09-08 ((2458005j,0s,0n),+0s,2299161j)>

d.to_s
#=> "2017-09-08"

"today is #{d}!"
#=> "today is 2017-09-08!"

and not just for string literals, but also for regular expression literals:

/today is #{d}!/
#=> /today is 2017-09-08!/

In your example, the object-to-be-interpolated is a Regexp:

foo_regex = /foo/

And Regexp#to_s returns:

[...] the regular expression and its options using the (?opts:source) notation.

foo_regex.to_s
#=> "(?-mix:foo)"

Therefore:

/A #{foo_regex}/i
#=> /A (?-mix:foo)/i

Just like:

"A #{foo_regex}"
#=> "A (?-mix:foo)"

In other words: because of the way Regexp#to_s is implemented, you can interpolate patterns without loosing their flags. It's a feature, not a bug.

If Regexp#to_s would return just the source (without options), it would work the way you expect:

def foo_regex.to_s
  source
end

/A #{foo_regex}/i
#=> /A foo/i

The above code is just for demonstration purposes, don't do that.

How to embed regular expressions in other regular expressions in Ruby

2 Answers2

Linked

Related