How to split a string without getting an empty string inserted in the array

Question

I'm having trouble splitting a character from a string using a regular expression, assuming there is a match.

I want to split off either an "m" or an "f" character from the first part of a string assuming the next character is one or more numbers followed by optional space characters, followed by a string from an array I have.

I tried:

2.4.0 :006 > MY_SEPARATOR_TOKENS = ["-", " to "]
 => ["-", " to "] 
2.4.0 :008 > str = "M14-19"
 => "M14-19" 
2.4.0 :011 > str.split(/^(m|f)\d+[[:space:]]*#{Regexp.union(MY_SEPARATOR_TOKENS)}/i)
 => ["", "M", "19"]

Notice the extraneous "" element at the beginning of my array and also notice that the last expression is just "19" whereas I would want everything else in the string ("14-19").

How do I adjust my regular expression so that only the parts of the expression that get split end up in the array?

I don't know what you mean by "split off" or "first part of a string". If the original string is `str` do you wish to return `str[1..-1]` when `str[0] =~ /[mf]/i` and the two other conditions are satisfied? What is to be returned if there is no match, `str`? — Cary Swoveland, Mar 11 '17 at 00:14

score 4 · Answer 1 · edited Mar 11 '17 at 00:18

I find match to be a bit more elegant when extracting characters from regular expressions in Ruby:

string = "M14-19"
string.match(/\A(?<m>[M|F])(?<digits>\d{2}(-| to )\d{2})/)[1, 2]
=> ["M", "14-19"]
# also can extract the symbols from match
extract_string = string.match(/\A(?<m>[M|F])(?<digits>\d{2}(-| to )\d{2})/)
[[extract_string[:m], extract_string[:digits]]
=> ["M", "14-19"]
string = 'M14 to 14'
extract_string = string.match(/\A(?<m>[M|F])(?<digits>\d{2}(-| to )\d{2})/)[1, 2]
=> ["M", "14 to 14"]

score 3 · Answer 2 · edited May 23 '17 at 12:26

You have a bug brewing in your code. Don't get in the habit of doing this:

#{Regexp.union(MY_SEPARATOR_TOKENS)}

You're setting yourself up with a very hard to debug problem.

Here's what's happening:

regex = Regexp.union(%w(a b)) # => /a|b/
/#{regex}/ # => /(?-mix:a|b)/
/#{regex.source}/ # => /a|b/

/(?-mix:a|b)/ is an embedded sub-pattern with its set of the regex flags m, i and x which are independent of the surrounding pattern's settings.

Consider this situation:

'CAT'[/#{regex}/i] # => nil

We'd expect that the regular expression i flag would match because it's ignoring case, but the sub-expression still only allows only lowercase, causing the match to fail.

Using the bare (a|b) or adding source succeeds because the inner expression gets the main expression's i:

'CAT'[/(a|b)/i] # => "A"
'CAT'[/#{regex.source}/i] # => "A"

See "How to embed regular expressions in other regular expressions in Ruby" for additional discussion of this.

Cary Swoveland · Answer 3 · 2017-03-11T06:44:38.240

 TOKENS = ["-", " to "]

 r = /
     (?<=\A[mMfF])             # match the beginning of the string and then one
                               # of the 4 characters in a positive lookbehind
     (?=                       # begin positive lookahead
       \d+                     # match one or more digits
       [[:space:]]*            # match zero or more spaces
       (?:#{TOKENS.join('|')}) # match one of the tokens
     )                         # close the positive lookahead
     /x                        # free-spacing regex definition mode

(?:#{TOKENS.join('|')}) is replaced by (?:-| to ).

This can of course be written in the usual way.

r = /(?<=\A[mMfF])(?=\d+[[:space:]]*(?:#{TOKENS.join('|')}))/

When splitting on r you are splitting between two characters (between a positive lookbehind and a positive lookahead) so no characters are consumed.

"M14-19".split r
  #=> ["M", "14-19"]
"M14     to 19".split r
  #=> ["M", "14     to 19"]
"M14     To 19".split r
  #=> ["M14     To 19"]

If it is desired that ["M", "14 To 19"] be returned in the last example, change [mMfF] to [mf] and /x to /xi.

The line `(?:#{TOKENS.join('|')})` was formerly `#{Regexp.union(TOKENS)}`. I changed it after reading @thetinman's answer, which clarified my understanding of [Regexp::union](http://ruby-doc.org/core-2.3.0/Regexp.html#method-c-union) and its potential pitfalls. — Cary Swoveland, Mar 11 '17 at 06:41

score 2 · Accepted Answer · edited May 23 '17 at 10:30

2

The empty element will always be there if you get a match, because the captured part appears at the beginning of the string and the string between the start of the string and the match is added to the resulting array, be it an empty or non-empty string. Either shift/drop it once you get a match, or just remove all empty array elements with .reject { |c| c.empty? } (see How do I remove blank elements from an array?).

Then, 14- is eaten up (consumed) by the \d+[[:space:]]... pattern part - put it into a (?=...) lookahead that will just check for the pattern match, but won't consume the characters.

Use something like

MY_SEPARATOR_TOKENS = ["-", " to "]
s = "M14-19"
puts s.split(/^(m|f)(?=\d+[[:space:]]*#{Regexp.union(MY_SEPARATOR_TOKENS)})/i).drop(1)
#=> ["M", "14-19"]

See Ruby demo

edited May 23 '17 at 10:30

Community

1
1

answered Mar 10 '17 at 23:06

Wiktor Stribiżew

607,720
39
448
563

Can you replace `reject` with `drop 1`, which would be more descriptive? – Cary Swoveland Mar 11 '17 at 01:21
@CarySwoveland: I modified the code example. `.drop(1)` is just another way to get rid of the first array element, but in most cases removing all empty elements is the expected behavior. – Wiktor Stribiżew Mar 11 '17 at 08:52

How to split a string without getting an empty string inserted in the array

4 Answers4

Linked