split large regular expression in different lines

Question

I have this regular expression:

INVALID_NAMES = /\b(bib$|costumes$|httpanties?|necklace|cuff link|cufflink|scarf|pendant|apron|buckle|beanie|hat|ring|blanket|polo|earrings?|plush|pacifier|tie$|panties|boxers?|slippers?|pants?|leggings|ibattz|dress|bodysuits?|charm|battstation|tea|pocket ref|pajamas?|boyshorts?|mimopowertube|coat|bathrobe)\b/i

and it's working in that way.... but I want to write something like this:

INVALID_NAMES = /\b(bib$|costumes$|httpanties?|necklace|cuff link|
                    cufflink|scarf|pendant|apron|buckle|beanie|hat|ring|blanket|
                    polo|earrings?|plush|pacifier|tie$|panties|boxers?|
                    slippers?|pants?|leggings|ibattz|dress|bodysuits?|charm|
                    battstation|tea|pocket ref|pajamas?|boyshorts?|
                    mimopowertube|coat|bathrobe)\b/i

but if I use the second option the words: cufflink, polo, slippers?, battstation and mimopowertube.... are not taken because the spaces that the word have before, example:

(this space before the word)cufflink

I'll be very grateful of any help.

score 3 · Answer 1 · answered Dec 07 '14 at 23:15

3

You may use something like this

INVALID_NAMES = [
  "bib$",
  "costumes$",
  "httpanties?",
  "necklace"
]
INVALID_NAMES_REGEX = /\b(#{INVALID_NAMES.join '|'})\b/i
p INVALID_NAMES_REGEX

answered Dec 07 '14 at 23:15

Fuelen

41
4

I really like your approach :D but I did: INVALID_NAMES = /\b(#{["word1", "word2", "etc"].join('|')})\b/i thanks man! – nisevi Dec 07 '14 at 23:52
@CarySwoveland the thing is that "bib$" its a string and when you try to apply "bib$"[INVALID_NAMES_REGEX] is not gonna work because "bib$" is not inside (which is inside is "bib") INVALID_NAMES_REGEX. Also the symbol "$" is denoting only the final of the line in the regular expression as you can see here: http://rubular.com/ – nisevi Dec 08 '14 at 01:07
My apologies. I was mixed-up. See my comment to gnome's answer. – Cary Swoveland Dec 08 '14 at 06:44

score 2 · Accepted Answer · answered Dec 08 '14 at 00:36

Construct Your Regex with the Space-Insensitive Flag

You can use the space-insensitive flag to ignore whitespace and comments in your regular expression. Note that you will need to use \s or other explicit characters to catch whitespace once you enable this flag, since the /x flag would otherwise cause the spaces to be ignored.

Consider the following example:

INVALID_NAMES =
    /\b(bib$          |
        costumes$     |
        httpanties?   |
        necklace      |
        cuff\slink    |
        cufflink      |
        scarf         |
        pendant       |
        apron         |
        buckle        |
        beanie        |
        hat           |
        ring          |
        blanket       |
        polo          |
        earrings?     |
        plush         |
        pacifier      |
        tie$          |
        panties       |
        boxers?       |
        slippers?     |
        pants?        |
        leggings      |
        ibattz        |
        dress         |
        bodysuits?    |
        charm         |
        battstation   |
        tea           |
        pocket\sref   |
        pajamas?      |
        boyshorts?    |
        mimopowertube |
        coat          |
        bathrobe
    )\b/ix

Note that you can format it in many other ways, but having one expression per line makes it easier to sort and edit your sub-expressions. If you want it to have multiple alternatives per line, you could certainly do that.

Making Sure It Works

You can see that the expression above works as intended with the following examples:

'cufflink'.match INVALID_NAMES
#=> #<MatchData "cufflink" 1:"cufflink">

'cuff link'.match INVALID_NAMES
#=> #<MatchData "cuff link" 1:"cuff link">

Thanks for the solution that you give me. I marked as correct because is the most "regexp" approach between the solutions presented here. — nisevi, Dec 08 '14 at 04:31
There's a problem with `bib$`, because of `$`: `'bib$'.match INVALID_NAMES #=> nil`. — Cary Swoveland, Dec 08 '14 at 06:08
@CarySwoveland It works fine for me: `'bib'.match INVALID_NAMES #=> #`. YMMV. — Todd A. Jacobs, Dec 08 '14 at 06:20
Am I missing something? Are `?` and `$` intended to have some special significance, or are they just characters? My point is that if `'bib$'` (not `'bib'`) is among the 'invalid words' and a string contains `'bib$'` (not `'bib'`), it won't be caught. — Cary Swoveland, Dec 08 '14 at 06:27
@CarySwoveland `$` is end-of-line. `?` is zero-or-one. Unless escaped, they are atoms, not characters. Basic regular expression stuff. The question is not about whether the OP's regular expressions will match a given corpus, but about spacing and line wraps, so you're *way* overthinking this. — Todd A. Jacobs, Dec 08 '14 at 06:39
Yes, of course! I was confused between words and the regexes to find them. Not the OP's fault--it's very clear. Sorry for the trouble. — Cary Swoveland, Dec 08 '14 at 06:43

score 1 · Answer 3 · answered Dec 07 '14 at 23:14

When you add a newline in the middle of a regex literal, the newline becomes a part of the regular expression. Look at this example:

"ab" =~ /ab/ # => 0

"ab" =~ /a
b/ # => nil

"a\nb" =~ /a
b/ # => 0

You can suppress the newline by appending a backslash at the end of the line:

"ab" =~ /a\
b/ # => 0

Applied to your regex (leading spaces also removed):

INVALID_NAMES = /\b(bib$|costumes$|httpanties?|necklace|cuff link|\
cufflink|scarf|pendant|apron|buckle|beanie|hat|ring|blanket|\
polo|earrings?|plush|pacifier|tie$|panties|boxers?|\
slippers?|pants?|leggings|ibattz|dress|bodysuits?|charm|\
battstation|tea|pocket ref|pajamas?|boyshorts?|\
mimopowertube|coat|bathrobe)\b/i

yeah its a good approach but I really want to leave the indentation space, so I can't apply your solution. Thanks anyway... — nisevi, Dec 07 '14 at 23:56

Cary Swoveland · Answer 4 · 2014-12-08T06:51:49.690

0

You might do it like this:

INVALID_NAMES = ['necklace', 'cuff link', 'cufflink', 'scarf', 'tie?', 'bib$']
r = Regexp.union(INVALID_NAMES.map { |n| /\b#{n}\b/i })

str = 'cat \n  cufflink bib cuff link. tie Scarf\n cow necklace? \n  ti. bib'
str.scan(r)
  #=> ["cufflink", "cuff link", "tie", "Scarf", "necklace", "ti", "bib"]

edited Dec 08 '14 at 06:51

answered Dec 07 '14 at 23:20

Cary Swoveland

106,649
6
63
100

score 0 · Answer 5 · edited May 23 '17 at 12:13

0

Your patterns are inefficient and will cause the Regexp engine to thrash badly.

I'd recommend you investigate what Perl's Regexp::Assemble can do to help your Ruby code:

edited May 23 '17 at 12:13

Community

1
1

answered Dec 08 '14 at 05:26

the Tin Man

158,662
42
215
303

split large regular expression in different lines

5 Answers5

Construct Your Regex with the Space-Insensitive Flag

Making Sure It Works