How to use Regexp.union to match a character at the beginning of my string

Question

I'm using Ruby 2.4. I want to match an optional "a" or "b" character, followed by an arbitrary amount of white space, and then one or more numbers, but my regex's are failing to match any of these:

2.4.0 :017 > MY_TOKENS = ["a", "b"]
 => ["a", "b"]
2.4.0 :018 > str = "40"
 => "40"
2.4.0 :019 > str =~ Regexp.new("^[#{Regexp.union(MY_TOKENS)}]?[[:space:]]*\d+[^a-z^0-9]*$")
 => nil
2.4.0 :020 > str =~ Regexp.new("^#{Regexp.union(MY_TOKENS)}?[[:space:]]*\d+[^a-z^0-9]*$")
 => nil
2.4.0 :021 > str =~ Regexp.new("^#{Regexp.union(MY_TOKENS)}?[[:space:]]*\d+$")
 => nil

I'm stumped as to what I'm doing wrong.

Try `Regexp.new("\\A#{Regexp.union(MY_TOKENS)}?[[:space:]]*\\d+\\z")`. Are the `a` and `b` single character strings or can they contain more than 1 character? — Wiktor Stribiżew, Mar 27 '17 at 19:11
I defined them as strings (first lien of the example) -- MY_TOKENS = ["a", "b"] . Does it matter if they are strings or chars? — Dave, Mar 27 '17 at 19:13
The `[...]` will form a character class matching a single char from the set. Your first example contains the character class. — Wiktor Stribiżew, Mar 27 '17 at 19:14
When you say an "optional" a or b character, do you mean that there _may_ be either an a or b at the beginning of the string, or that there _must_ be either an a or b at the beginning of the string? The way you have it written, there _must_ be either an a or b at the beginning, so it's true that that regex does not match "40". — Glyoko, Mar 27 '17 at 19:15
Look at the string representation of your Regexp. The union is not a character union but an expression union (`|`). — Raffael, Mar 27 '17 at 19:16
@Glyoko, I mean there can be an "a", a "b", or neither at teh beginning of my string (definitely not both). I thought putting a "?" after the expression would say "only get zero or one instances." — Dave, Mar 27 '17 at 19:17
Please strip out the Irb prompts. They add visual noise making the code harder to read. — the Tin Man, Mar 27 '17 at 20:35

score 3 · Answer 1 · edited May 23 '17 at 12:02

3

If they are single characters, just use MY_TOKENS.join inside the character class:

MY_TOKENS = ["a", "b"]
str = "40"
first_regex = /^[#{MY_TOKENS.join}]?[[:space:]]*\d+[^a-z0-9]*$/
# /^[ab]?[[:space:]]*\d+[^a-z0-9]*$/ 
puts str =~ first_regex
# 0

You can also integrate the Regexp.union, it might lead to some unexpected bugs though, because the flags of the outer regexp won't apply to the inner one :

second_regex = /^#{Regexp.union(MY_TOKENS)}?[[:space:]]*\d+[^a-z0-9]*$/
# /^(?-mix:a|b)?[[:space:]]*\d+[^a-z0-9]*$/
puts str =~ second_regex
# 0

The above regex looks a lot like what you did, but using // instead of Regexp.new prevents you from having to escape the backslashes.

You could use Regexp#source to avoid this behaviour :

third_regex = /^(?:#{Regexp.union(MY_TOKENS).source})?[[:space:]]*\d+[^a-z0-9]*$/
# /^(?:a|b)?[[:space:]]*\d+[^a-z0-9]*$/
puts str =~ third_regex
# 0

or simply build your regex union :

fourth_regex = /^(?:#{MY_TOKENS.join('|')})?[[:space:]]*\d+[^a-z0-9]*$/
# /^(?:a|b)?[[:space:]]*\d+[^a-z0-9]*$/
puts str =~ fourth_regex
# 0

The 3 last examples should work fine if MY_TOKENS has words instead of just characters.

first_regex, third_regex and fourth_regex should all work fine with /i flag.

As an example :

first_regex = /^[#{MY_TOKENS.join}]?[[:space:]]*\d+[^a-z0-9]*$/i
"A 40" =~ first_regex
# 0

edited May 23 '17 at 12:02

Community

1
1

answered Mar 27 '17 at 19:15

Eric Duminil

52,989
9
71
124

I went away from the // syntax in favor of Regexp.new syntax because I was having some problems with //, and then this foolio gave me reason to think I shoudl be using Regexp.new -- http://stackoverflow.com/questions/43024015/why-is-my-string-matching-something-in-an-array-when-that-string-doesnt-contain . Perhaps I misread his post. – Dave Mar 27 '17 at 20:17
Ah. Looks like someone's been reading my notes about why we should use `source`. :-) – the Tin Man Mar 27 '17 at 20:37
1

`[^a-z^0-9]` is very awkward. `[^a-z\d]` seems like it'd be clearer. At least use `[^a-z0-9]` because the second `^` introduces the caret into the character-class to be ignored, creating a potential hole in the pattern. – the Tin Man Mar 27 '17 at 20:39
I harp on it all the time. No need to link to it unless one of the answers seems to 'splain it in a way you want to share. – the Tin Man Mar 27 '17 at 20:41
@theTinMan Fun fact, the linked answer was for the same OP. – Eric Duminil Mar 27 '17 at 20:49

score 1 · Accepted Answer · edited May 23 '17 at 12:02

1

I believe you want to match a string that may contain any of the alternatives you defined in the MY_TOKENS, then 0+ whitespaces and then 1 or more digits up to the end of the string.

Then you need to use

Regexp.new("\\A#{Regexp.union(MY_TOKENS)}?[[:space:]]*\\d+\\z").match?(s)

or

/\A#{Regexp.union(MY_TOKENS)}?[[:space:]]*\d+\z/.match?(s)

When you use a Regexp.new, you should rememeber to double escape backslashes to define a literal backslash (e.g. "\d" is a digit matching pattern). In a regex literal notation, you may use a single backslash (/\d/).

Do not forget to match the start of a string with \A and end of string with \z anchors.

Note that [...] creates a character class that matches any char that is defined inside it: [ab] matches an a or b, [program] will match one char, either p, r, o, g, r, a or m. If you have multicharacter sequences in the MY_TOKENS, you need to remove [...] from the pattern.

To make the regex case insensitive, pass a case insensitive modifier to the pattern and make sure you use .source property of the Regex.union created regex to remove flags (thanks, Eric):

Regexp.new("(?i)\\A#{Regexp.union(MY_TOKENS).source}?[[:space:]]*\\d+\\z")

or

/\A#{Regexp.union(MY_TOKENS).source}?[[:space:]]*\d+\z/i

The regex created is /(?i-mx:\Aa|b?[[:space:]]*\d+\z)/ where (?i-mx) means the case insensitive mode is on and multiline (dot matches line breaks and verbose modes are off).

edited May 23 '17 at 12:02

Community

1
1

answered Mar 27 '17 at 19:18

Wiktor Stribiżew

607,720
39
448
563

Thanks this is working. How woudl I make this match case insensitive? I tried Regexp.new("\\A#{Regexp.union(MY_TOKENS)}?[[:space:]]*\\d+[^a-z^0-9]*\\z", Regexp::IGNORECASE).match?(data) but it isn't working. Note that since this is somewhat of a separate question if you prefer me to open a new SO post, I can do that. – Dave Mar 27 '17 at 19:28
Hey Regexp.new("(?i)\\A#{Regexp.union(MY_TOKENS)}?[[:space:]]*\\d+[^a-z^0-9]*\\z").match?(data) is returning false if data is "A40" . – Dave Mar 27 '17 at 19:34
1

Ok, that is because of the `Regexp.union` - it puts the group into `(?-imx:...)` modifier group. Let me see to it. – Wiktor Stribiżew Mar 27 '17 at 19:36
1

@WiktorStribiżew: That's a known bug/feature. You need `Regexp#source` to remove the inner flags. – Eric Duminil Mar 27 '17 at 19:43
@Dave: No need for a new question. You already [asked it](http://stackoverflow.com/questions/43024296/how-do-i-use-regexp-union-within-another-regular-expression) and I already [answered it](http://stackoverflow.com/questions/43024296/how-do-i-use-regexp-union-within-another-regular-expression) with the warning about case-sensitivity. ;) – Eric Duminil Mar 27 '17 at 19:47
Hi @Eric, Thanks for that reminder. But when I tried (the answer from your post), "Regexp.new("\\A#{Regexp.union(MY_TOKENS)}?[[:space:]]*\\d+[^‌a-z^0-9]*\\z", Regexp::IGNORECASE).match?(data)", it didn't work (it failed to match a capitalized version of my data). What's that all about? – Dave Mar 27 '17 at 19:52
@Dave: It doesn't look like anything I've posted. `first_regex`, `third_regex` and `fourth_regex` should all work fine with `/i` flag. – Eric Duminil Mar 27 '17 at 19:54
Having to use `source` in an embedded regex isn't a bug, it's just how the language should behave. If we know what we're doing and want a particular behavior, such as having the embedded sub expression, then `source` isn't needed. Unfortunately, too many people don't know why it can be a bad thing, at least until they run into it and have to spend hours debugging to figure out why their pattern has a hole in it. – the Tin Man Mar 27 '17 at 20:44
@WiktorStribiżew: `Eric` links to your current answer ;) – Eric Duminil Mar 27 '17 at 20:51

How to use Regexp.union to match a character at the beginning of my string

2 Answers2

Linked