1

I have a (strange) string like:

EREF+012345678901234MREF+ABCDEF01234567890123CRED+DE12ABC01234567890SVWZ+ABCEDFG HIJ 01234567890 123,45ABWA+ABCDEFGHIJKLMNOPQR

The pattern I need to look for can only be defined by keywords: EREF+, MREF+, CRED+ and others. I know there are 19 keywords, but the string may contain different subsets of these 19 keywords. I don't know if the order stays the same, from what I can tell EREF+ will most likely be the first keyword, but the order may as well differ. I also don't know which of the 19 keywords might be the last one in the string as that may change case by case.

My first approach was to just use explode() twice, with keyword 1 and keyword 2 – but if the keywords change order (and I cannot guarantee they don't) I would have to go through all possible combinations.

Anyway, here's the first (working) code I used:

<?php 

$string = "EREF+012345678901234MREF+ABCDEF01234567890123CRED+DE12ABC01234567890SVWZ+ABCEDFG HIJ 01234567890 123,45ABWA+ABCDEFGHIJKLMNOPQR";

function getBetween($content,$start,$end){
    $r = explode($start, $content);
    if (isset($r[1])){
        $r = explode($end, $r[1]);
        return $start.$r[0];
    }
    return '';
}

$start = "EREF+";
$end = "MREF+";
$output = getBetween($string,$start,$end);
echo $output;

?>

So now I am looking into regex to come up with a solution that extracts a substring between two keywords, where any of the keywords can be the start delimiter while any other keyword may be the end delimiter.

Since there are literally thousands of regex questions around, I took some time and tried to adapt from other solutions, but no success until now. I must confess regex is voodoo to me and I cannot seem to remember the patterns for more than a minute. I found this thread which is pretty close to what I am trying to achieve, and tried a few tweaks but I cannot get it to work properly.

Here's my code so far:

<?php 

$string = "EREF+012345678901234MREF+ABCDEF01234567890123CRED+DE12ABC01234567890SVWZ+ABCEDFG HIJ 01234567890 123,45ABWA+ABCDEFGHIJKLMNOPQR";

$matches = array();
$keywords = ['EREF+', 'MREF+', 'CRED+', 'SVWZ+', 'ABWA+'];
$pattern = sprintf('/(?:%s):(.*?)/', join('|', array_map(function($keyword) {
    return preg_quote($keyword, '/');
}, $keywords)));

preg_match_all($pattern, $string, $matches);

print_r($matches);

?>

... whereas the constructed pattern looks like this:

/(?:EREF\+|MREF\+|CRED\+|SVWZ\+|ABWA\+):(.*?)/

Can anyone advise please? Any help appreciated!

Thanks

Community
  • 1
  • 1
larsgrau
  • 13
  • 3
  • Do you need to know which keyword caused the split? Maybe `preg_split`? https://eval.in/656629 – chris85 Oct 06 '16 at 20:52
  • You are right, I actually do need to know which keyword caused the split. Didn't think about that yet. – larsgrau Oct 06 '16 at 21:08
  • +1 for the `preg_split` approach. With the help of [this comment here](http://stackoverflow.com/a/2938159/6934045) and [that comment there](http://stackoverflow.com/a/11758732/6934045) I've managed to fork your code to include the keywords that caused the split as keys in an associative array: [eval.in/656679](https://eval.in/656679) – larsgrau Oct 06 '16 at 21:46

1 Answers1

1

You can use this regex:

/(?<=EREF\+|MREF\+|CRED\+|SVWZ\+|ABWA\+)(.+?)(?=EREF\+|MREF\+|CRED\+|SVWZ\+|ABWA\+|$)/

It will match the strings between defined keywords.

(?<=EREF\+|MREF\+|CRED\+|SVWZ\+|ABWA\+) # look backward for a keyword
(.+?) #Match any character, non greedy
(?=EREF\+|MREF\+|CRED\+|SVWZ\+|ABWA\+|$) # Look forward for a keyword or end of string

Regex101

Edit: If you want to know what keyword caused the split you can use this regex:

/((?:EREF\+|MREF\+|CRED\+|SVWZ\+|ABWA\+))(.+?)(?=EREF\+|MREF\+|CRED\+|SVWZ\+|ABWA\+|$)/

It will capture the first keyword and the text between keywords.

Live sample

Leonardo Xavier
  • 443
  • 3
  • 16
  • Thanks for the quick respone! But unfortunately that will not get the last occurence of a keyword, in my example `ABWA+`. Any idea how to deal with that? – larsgrau Oct 06 '16 at 21:06
  • the last one isn't between keywords, but you can put `$` as an option in the look foward, I'll update the answer – Leonardo Xavier Oct 06 '16 at 21:08
  • Wow, works like a charm! Thanks! @chris85 brought up a thought that I missed yet. How do I know which of the keywords actually caused the splitting of the string? As far as I see this is not possible with the regex, right? – larsgrau Oct 06 '16 at 21:14
  • I included the keyword that caused the split in the answer – Leonardo Xavier Oct 06 '16 at 22:09