9

Is there any regex engine which would allow me to match multiple heredoc strings on an expression? E.g., as one would write in Ruby:

f <<FOO, 10, <<BAR, 20
some text
FOO
some more text
BAR

I tought of using backrefs and recursive calling in Perl's flavor, but couldn't manage to make the cross-serial dependencies work (i.e., couldn't reverse the captured backrefs, as FOO should match before BAR). I also thought of balancing groups on .Net, where I can reverse the stack by using lookaheads (I know, this is a terrible hack), like this:

(?:(?<x>foo|bar|baz)|\s)+(?(x)|(?!))\s*(?(x)(?=(.*?)(?<-x>(?<y>\k<x>)))){3}(?(x)(?!))(?:(?(y)(?<-y>\k<y>))|\s)+(?(x)(?!))(?(y)(?!))

(Click here to test it.)

That matches foo bar baz foo bar baz, but then I have to add a manual counter (the {3}), since the lookahead won't repeat with + since it doesn't consume any input I assume. Thus this won't work on arbitrary cases (but it was close!). I could, of course, replace that by {1000} or any other big number, and that would answer my question, but I wonder if there are other ways.

Acknowledgment: I do understand it is not a good idea to match such kind of construct with regexes. I am doing research work on such, and I want to find out if it is possible. If it is, please do not use it in production code.

Community
  • 1
  • 1
paulotorrens
  • 2,286
  • 20
  • 30
  • Are you looking to get the contents of the heredoc? – Laurel Jul 31 '16 at 20:43
  • Not particularly; I'm working with a few parsing algorithms, and I'm trying to reduce regexes to linear indexed grammars so that I can always match in cubic time, and I'm trying to figure out which constraints I should add to make that possible. This would be an extreme scenario of abusing regex' features. – paulotorrens Jul 31 '16 at 20:48
  • Something like this: https://regex101.com/r/pC5gZ3/1 ? – Casimir et Hippolyte Jul 31 '16 at 20:50
  • Not really; it does not enforce both tags to appear in order. I might remove the `FOO` from the middle of the string and it would still match; in a worst case scenario, I could have `BAR\nFOO\nBAR`, the first `BAR` being ignored as it was part of the `FOO` heredoc string. – paulotorrens Jul 31 '16 at 20:53
  • 1
    It does. Did you tried to remove the FOO in the middle? – Casimir et Hippolyte Jul 31 '16 at 20:57
  • Sorry, my bad. When removing the `FOO` in the middle it would match a smaller expression (by ignoring the `< – paulotorrens Jul 31 '16 at 21:05
  • 1
    In this case, more like this: https://regex101.com/r/cR5bD5/1 – Casimir et Hippolyte Jul 31 '16 at 21:23
  • Really interesting approach. I'm still checking it, but could it be augmented to recognize any number of heredoc tags? E.g., `< – paulotorrens Jul 31 '16 at 21:28
  • 1
    Yes, a token was at the wrong place: https://regex101.com/r/fZ3kI5/1 – Casimir et Hippolyte Jul 31 '16 at 21:42
  • Impressive! Thanks a lot! One little problem, though: it doesn't seem to work when I'm repeating the tag name, which is accepted in Ruby; e.g., `< – paulotorrens Jul 31 '16 at 21:53
  • 2
    You couldn't call it a little problem. – revo Jul 31 '16 at 22:03
  • Indeed, I shouldn't. I meant it was a corner case. Trying to fix it myself. =P – paulotorrens Jul 31 '16 at 22:13
  • 1
    Thank you very much, @CasimiretHippolyte! Once I've understood your thinking, I've been able to write the regex: [https://regex101.com/r/dM5qJ8/1](https://regex101.com/r/dM5qJ8/1). This seems to work on all cases. If you post it as an answer, I'll gladly accept it. :D – paulotorrens Jul 31 '16 at 23:05
  • https://regex101.com/r/dM5qJ8/3 @PauloTorrens – revo Aug 01 '16 at 09:07
  • This was expected, @revo. I would need to anchor the expression (in a CFG-like `(?define)` structure) in order to actually fully-"parse" the whole expression. – paulotorrens Aug 01 '16 at 12:45
  • 1
    In future it would help ease people understanding your problem if you showed output you get, vs output you expect, rather than a wall of text. Helps people get to grips with your problem faster, and you'll likely get a swifter response addressing your desired outcome. – JGFMK Jul 02 '18 at 17:40

1 Answers1

3

Since your primary question is "is it possible with regex", I would like to start by sharing my favorite regex information site. Specifically, How does the regex engine work? Learning that will give you a better idea of how regex works, and why trying to step out of it's very well defined box quickly devolves into broken souls and CPUs.

The key takeaway though is that at any point, the regex engine only has 2 pieces of information.

  1. What token am I trying to match?
  2. What is the next token in the string?

This is easy to forget because unlike steam parsing, the regex engine can backtrack when it fails a match (which, it usually is doing a lot of). And CPUs are fast enough that it can fail millions of times per second! While Regex may appear to have more memory because it can match "The first cat after dog", it only knows it has seen the word dog, because it is currently looking for the c in cat. Or in other words, the current state is only possible because a specific precondition(s) was met.

With a finite number of permutations of a pattern, a long enough regex can match anything. (The length of that regex may be soul/CPU crushing, but technically possible)

Where the pattern is not finite, like "match a number of a followed by an equal number of bs" (ex "ab" "aabb" "aaabbb" ect.) Regex has no mechanism to remember how many a's it has seen, so it doesn't know how many b's to match. You can work around this by trying to match all the variations (ab|aabb|aaabbb|aaaabbbb|...), but this would be insanely expensive to parse, and you would fail to capture every valid input because I can always add one more ab pair.

So really, you need to ask yourself 2 questions

  1. Is there a finite number of permutations I care to match?
  2. Will my soul/CPU survive such regex?

Related, you should probably read up on Nondeterministic finite automaton. Since you are asking for research reasons, and any pure regex engine is a NDFA, it helps to know they suffer from the same known limitations.

TL:DR;

Using a pure regex engine...

Practically, Yes, but at the cost of a soul.
Theoretically, No, not at all. There will always be a valid case your regex will fail.

Tezra
  • 8,463
  • 3
  • 31
  • 68