3

I have a string:

Ayy ***lol* m8\nlol"

I would like to not include the empty capture and produce:

["Ayy ", "**", "*", "lol", "*", " m8", "\n", "lol"]

I am splitting the string by this regex:

/(?x)(\*\*|\*|\n|[.])/

This produces:

["Ayy ", "**", "", "*", "lol", "*", " m8", "\n", "lol"]
sawa
  • 165,429
  • 45
  • 277
  • 381
0x777C
  • 993
  • 7
  • 21
  • Why not just remove all empty array items? Consecutive matches when splitting always produce empty array items. Rather than switch to *matching* rather than splitting, use `rarr = arr.reject { |c| c.empty? }` – Wiktor Stribiżew Jul 09 '18 at 16:51
  • I'd like to consider all possibilities before opting for the obvious, because that doesn't seem like the cleanest solution possible – 0x777C Jul 09 '18 at 16:55
  • And do you think re-vamping totally the pattern is a "clean" approach? Then please share what you have done so far. – Wiktor Stribiżew Jul 09 '18 at 16:56
  • I just came up with `@unformated.scan(/(? x)(\*\*|\*|\n|[^\*\*|\*|\n]+)/)` but I really dislike the redundancy – 0x777C Jul 09 '18 at 17:01
  • Ok, it is "close", but what you call redundancy is unavoidable. However, correct pattern is [`.scan(/(?x)\*{2}|[*\n.]|(?:(?!\*{2})[^*\n.])+/)`](https://regex101.com/r/g2oFXC/2). As you see, removing empty array items is much cleaner. – Wiktor Stribiżew Jul 09 '18 at 17:03
  • I see what you mean, that said answer with that regex so I can mark it as the answer anyway (since it is the appropriate answer to what I asked) – 0x777C Jul 09 '18 at 17:06
  • It's an interesting question but clarification is required (preferably with an edit). Your question is framed around a specific string (example). At issue is how the string might differ from the one given and for each variant what is the desired result? For example, can there be more than three asterisks in a row and if so, are they to be grouped in pairs, possibly with a single asterisk in the last group? Can there be more than one newlines (`"\n"`) in a row, in which case are they to be grouped together or split up? And so on. – Cary Swoveland Jul 10 '18 at 17:12
  • They are to be split up, I know of no way to make it clearer. – 0x777C Jul 10 '18 at 18:39

3 Answers3

5

Here is a simplified version of your regex, chained with a method to remove empty strings -- which is inevitably necessary here when using String#split, since there is an 'empty result' in the middle of '***':

string = "Ayy ***lol* m8\nlol"


string.split(/(\*{1,2}|\n|\.)/).reject(&:empty?)
  #=> ["Ayy ", "**", "*", "lol", "*", " m8", "\n", "lol"] 

A few differences from your pattern:

  • I have removed the (?x); this served no purpose. Extended patterns are useful for ignoring spaces and comments within the regex - neither of which you are doing here.
  • \*\*|\* can be simplified to \*{1,2} (or \*\*? if you prefer).
  • [.] is technically fine, but \. is one character shorter and in my opinion shows clearer intent.
Tom Lord
  • 27,404
  • 4
  • 50
  • 77
3

When splitting with a regex containing capturing groups, consecutive matches always produce empty array items.

Rather than switch to a matching approach, use

arr = arr.reject { |c| c.empty? }

Or any other method, see How do I remove blank elements from an array?

Else, you will have to match the substrings using a regex that will match the deilimiters first and then any text that does not start the delimiter texts (that is, you will need to build a tempered greedy token):

arr = s.scan(/(?x)\*{2}|[*\n.]|(?:(?!\*{2})[^*\n.])+/)

See the regex demo.

Here,

  • (?x) - a freespacing/comment modifier
  • \*{2} - ** substring
  • | - or
  • [*\n.] - a char that is either *, newline LF or a .
  • | - or
  • (?:(?!\*{2})[^*\n.])+ - 1 or more (+) chars that are not *, LF or . ([^*\n.]) that do not start a ** substring.
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
1
r = /
    [ ]+    # match one or more spaces
    |       # or
    (\*)    # match one asterisk in capture group 1
    [ ]*    # match zero or more spaces
    (?!\*)  # not to be followed by an asterisk (negative lookahead)
    |       # or
    (\n)    # match "\n" in capture group 2
    /x      # free-spacing regex definition mode

str = "Ayy ***lol* m8\nlol"

str.split r
  #=> ["Ayy", "**", "*", "lol", "*", "m8", "\n", "lol"]
Cary Swoveland
  • 106,649
  • 6
  • 63
  • 100
  • This solution does not guarantee there won't be empty items in the resulting array. See [this Ruby demo](https://ideone.com/gcKh1q). Also, I understand `(\*)[ ]*(?!\*)` is used to match any`*` that is not followed with `*` that may have any amount of spaces in between, right? If so, you need `(\*)[ ]*+(?!\*)` to avoid backtracking into `[ ]+`. – Wiktor Stribiżew Jul 10 '18 at 06:47
  • What happens if there are more than 3 consecutive `*`s? For example, if `str = "Ayy ******lol* m8\nlol"` then: `str.split r #=> ["Ayy", "*****", "*", "lol", "*", "m8", "\n", "lol"]`. Is that the desired behaviour? I'm not sure, but it's different to how OP's version would behave. – Tom Lord Jul 10 '18 at 09:38
  • @Wiktor, I've asked for clarification of the question. Regardless, can you elaborate on the need to change `[ ]*` to `[ ]*+`? (`x*+` is new to me.) Also, omitting the plus sign seems to work here. Can you give an example where `[ ]*+` works but `[ ]*` doesn't? – Cary Swoveland Jul 10 '18 at 17:23
  • @Tom, I've asked for clarification of the question. – Cary Swoveland Jul 10 '18 at 17:23
  • @CarySwoveland `\* *(?!\*)` will backtrack sooner or later, thus you should use possessive quantifiers in these cases. Or include the space into the lookahead, `\* *(?![* ])`. – Wiktor Stribiżew Jul 11 '18 at 06:29