0

I have a quite silly problem, which is staggering me for a while...
I want to parse some text, formatted this way:

CUT-FROM-A ...
CUT-FROM-B ...
CUT-TO ...
CUT-TO
apple
CUT-FROM-C ...
CUT-TO
orange

In this example, I would like to extract the 'fruits', ignoring everything from CUT-FROM-X to the corresponding TO. By 'corresponding' I mean "from inside to outside", or if it's clearer, try mentally substiting any CUT-FROM-A with an open bracket, and any CUT-TO with a closed bracket: then, I want to ignore the content inside the brackets, including the brackets.
I hope this is clear, but I'm afraid it's not... :-(
I suppose the main difficulty here is that the 'closing brackets' all have the same signature, so can't be easily associated with the relative opener...

I have tried something like this (not greedy):

$output_text = preg_replace("/CUT-FROM-.*?TO/s", "", $input_text);

but this leaves the second CUT-TO in the output...

And something like this (greedy):

$output_text = preg_replace("/CUT-FROM-.*TO/s", "", $input_text);

but this eats the first 'fruit'... :-(

This is my testing on regex101.

Anybody can shed some light on me?

MarcoS
  • 17,323
  • 24
  • 96
  • 174

3 Answers3

3

Since you're asking for a regex solution, a readable recursive regex would be:

(?(DEFINE)
  (?<cut>
    ^CUT-FROM-
    (?&content)*?
    ^CUT-TO
  )

  (?<content>
    (?: (?!CUT-(?:FROM-|TO)) . )++
    | (?&cut)
  )
)

(?&cut)

Demo

Use with the smx options. This matches everything you want to ignore, so you can replace it with an empty string. The syntax (?&something) means recurse into something, it's the same as \g<something>.

And here's a more compact version that does essentially the same thing:

^CUT-FROM-
(?:(?:(?!CUT-(?:FROM-|TO)) . )++ | (?R))*?
^CUT-TO

Demo

In this version, (?R) means recurse the whole pattern. It still uses the smx options. The one-liner version (without x) would be:

(?sm)^CUT-FROM-(?:(?:(?!CUT-(?:FROM-|TO)).)++|(?R))*?^CUT-TO

But I advise against doing such things. Prefer the version with the (?(DEFINE) ... ) for readability.

Lucas Trzesniewski
  • 50,214
  • 11
  • 107
  • 158
  • Yup!! I'm very impressed... I though **I** was a regex guru... :-( – MarcoS Jan 13 '15 at 10:50
  • This is, like you said, just like a standard brace balancing pattern, except I replaced the braces with your delimiters ;) [Here's](http://stackoverflow.com/a/27828040/3764814) a related answer of mine with some more explanations about this. Once you get it, it's pretty easy. – Lucas Trzesniewski Jan 13 '15 at 11:01
1

Just a thought, you could process each line matching the context you want instead of replacing.

preg_match_all('~^(?!.*CUT-(?:FROM|TO)).+$~mi', $text, $matches);
var_dump($matches[0]);

Output

array(2) {
  [0]=> string(5) "apple"
  [1]=> string(6) "orange"
}
hwnd
  • 69,796
  • 4
  • 95
  • 132
  • Great idea! Thanks, this solves my problem... Though, it does not perfectly answers the question, so I'll wait some time before accepting it... – MarcoS Jan 13 '15 at 10:27
0

You can do this with a single regex but you can do it better with a simple script that uses small regexs for smaller tasks.

The idea: parse the text line by line, use regex to identify the line type. On every 'CUT-FROM' line, add information (the line itself or something else) to a stack (using array_push()). On every 'CUT-TO' line remove the top element from the stack (using array_pop().

Process other rows as you need. For example, if you need to ignore the lines between a 'CUT-FROM' and the corresponding 'CUT-TO' line you need to check that the stack is not empty to know that you are inside a pair. If the stack is empty then all the 'CUT-FROM' were paired with 'CUT-TO' lines and you are parsing lines outside of any enclosure.

This approach also provides you a nice way to detect and handle (ignore/fix/report/whatever) the errors in the input text.

Sample program:

text = <<< END_TEXT
CUT-FROM-A ...
ignore this,
CUT-FROM-B ...
this,
CUT-TO ...
and this
CUT-TO
apple
CUT-FROM-C ...
CUT-TO
orange
END_TEXT;

$lines = explode("\n", $text);


$stack = array();
foreach ($lines as $i => $line) {
    // Check if it's a 'CUT-FROM-' line
    if (preg_match('/^CUT-FROM-/', $line)) {
        array_push($stack, $line);
        continue;
    }

    // Check if it's a 'CUT-TO' line
    if (preg_match('/^CUT-TO/', $line)) {
        if (array_pop($stack) === NULL) {
            // an unpaired 'CUT-TO' was found
            echo("An unpaired 'CUT-TO' was found on line ".($i + 1).". Will ignore it.\n");
        }
        continue;
    }


    // A regular line
    if (count($stack) > 0) {
        // inside a (CUT-FROM, CUT-TO) pair
        // count($stack) tells how many pairs are around this item

        // ignore it

    } else {
        // outside any pair
        echo ($line."\n");
    }
}

// Check if all the 'CUT-FROM' lines were closed
if (count($stack) > 0) {
    echo('Found that '.count($stack)." 'CUT-TO' lines are missing at the end of processing.\n");
}
axiac
  • 68,258
  • 9
  • 99
  • 134
  • This is the obvious procedural solution, but the question was about *regex*'s... :-) (See answer by Lucas Trzesniewski...) – MarcoS Jan 13 '15 at 10:49
  • Classic `regex` cannot parse this input. Using modern additions like negative lookaheads it can be done, indeed. Depending on the context, you can use a `regex` or a more explicit procedural solution. I think the procedural solution is better if the format will change in the future or when you need to detect errors in the input and recover from them. Otherwise, go for `regex`; it is more compact and runs faster. – axiac Jan 13 '15 at 10:55
  • You are completely right. I am really in doubt if choosing the dirty but fast regex solution, or a more robust but more verbose procedural solution like yours... However, sorry, I will accept the best *regex* solution... – MarcoS Jan 13 '15 at 11:00