How to split look-ahead regex into 2 plain regexes?

Question

I have a look-ahead regex [^a-z0-9%*][a-z0-9%]{3,}(?=[^a-z0-9%*]). In my test it extracts 4 substrings from @@||imasdk.googleapis.com/js/core/bridge*.html:

|imasdk
.googleapis
.com
/core

I need to rewrite it with 2 good-old regexes as i can't use look-aheads (not supported by regex engine). I've split it into [^a-z0-9%*][a-z0-9%]{3,} and [^a-z0-9%*] and the latter is checked for each first regex match in the substring after the match.

For some reason it extracts /bridge too as . is not listed in [^a-z0-9%*] and is found after /bridge. So how does the look-ahead works: does it have to be a full match, a substr (find result) or anything else? Does it mean every ending char is expected to be not from the set a-z0-9%* in this case?

In Rust the code looks as follows:

    lazy_static! {
        // WARNING: the original regex is `"[^a-z0-9%*][a-z0-9%]{3,}(?=[^a-z0-9%*])"` but Rust's regex
        // does not support look-around, so we have to check it programmatically for the last match
        static ref REGEX: Regex = Regex::new(r###"[^a-z0-9%*][a-z0-9%]{3,}"###).unwrap();
        static ref LOOKAHEAD_REGEX: Regex = Regex::new(r###"[^a-z0-9%*]"###).unwrap();
    }

    let pattern_lowercase = pattern.to_lowercase();
    
    let results = REGEX.find_iter(&pattern_lowercase);
    for (is_last, each_candidate) in results.identify_last() {
        let mut candidate = each_candidate.as_str();
        if !is_last {
            // have to simulate positive-ahead check programmatically
            let ending = &pattern_lowercase[each_candidate.end()..]; // substr after the match
            println!("searching in {:?}", ending);
            let lookahead_match = LOOKAHEAD_REGEX.find(ending);
            if lookahead_match.is_none() {
                // did not find anything => look-ahead is NOT positive
                println!("NO look-ahead match!");
                break;
            } else {
                println!("found look-ahead match: {:?}", lookahead_match.unwrap().as_str());
            }
        }
         ...

test output:

"|imasdk":
searching in ".googleapis.com/js/core/bridge*.html"
found look-ahead match: "."
".googleapis":
searching in ".com/js/core/bridge*.html"
found look-ahead match: "."
".com":
searching in "/js/core/bridge*.html"
found look-ahead match: "/"
"/core":
searching in "/bridge*.html"
found look-ahead match: "/"
"/bridge":
searching in "*.html"
found look-ahead match: "."

^ here you can see /bridge is found due to following . and it's incorrect.

How about using `[^a-z0-9%*][a-z0-9%]{3,}[^a-z0-9%*]` and stripping off the last character of the match? — Sven Marnach, Mar 30 '21 at 09:47
it seems to be not equal to positive lookahead meaning. I might be wrong but i understand lookahead as "1 character not from the range ... anywhere in the ending after the match" and in your case it is expected to follow right after the match — 4ntoine, Mar 30 '21 at 10:59
No, lookahead is "1 character not from the range right after the match" (otherwise your regex101 test would find `/bridge` because of the `.` in the "ending after the match"). — Jmb, Mar 30 '21 at 11:39
If you want to keep a two-regexp approach, your second expression should be `^[^a-z0-9%*]`. — Jmb, Mar 30 '21 at 11:41
Is there any advantage to using two regexes in this case? It's way more complex than simply stripping off the last character. — Sven Marnach, Mar 30 '21 at 14:27
It's not because of some benefit, it's rather a workaround - because look-arounds are not supported by the most popular regex engine in Rust — 4ntoine, Mar 30 '21 at 20:18
@4ntoine I'm aware. Using the regex I suggested in my first comment and then stripping off the last character is also a workaround for the same problem. It's just a lot simpler to code. — Sven Marnach, Apr 01 '21 at 22:22

score 1 · Accepted Answer · answered Mar 30 '21 at 14:28

Your LOOKAHEAD_REGEX looks for a character not in the range in any position after the match, but the original regex with lookahead only looks at the single character immediately after the match. This is why your code finds /bridge and regex101 doesn't: your code sees the . somewhere after the match whereas regex101 only looks at the *.

You can fix your code by anchoring LOOKAHEAD_REGEX so that it will only look at the first character: ^[^a-z0-9%*].

Aternatively, as suggested by @Sven Marnach, you can use a single regex matching the full expression: [^a-z0-9%*][a-z0-9%]{3,}[^a-z0-9%*], and strip the last character of the match.

@Sven Marnach suggestion as-is is incorrect as separator char (`^a-z0-9%*`) can be either ending of the first match or beginning of the second match. Eg. in [`/asdf/1234^`](https://regex101.com/r/Kzff3K/1) `1234` will not be extracted — 4ntoine, Apr 01 '21 at 10:47

How to split look-ahead regex into 2 plain regexes?

1 Answers1