I have a look-ahead regex [^a-z0-9%*][a-z0-9%]{3,}(?=[^a-z0-9%*])
. In my test it extracts 4 substrings from @@||imasdk.googleapis.com/js/core/bridge*.html
:
|imasdk
.googleapis
.com
/core
I need to rewrite it with 2 good-old regexes as i can't use look-aheads (not supported by regex engine). I've split it into [^a-z0-9%*][a-z0-9%]{3,}
and [^a-z0-9%*]
and the latter is checked for each first regex match in the substring after the match.
For some reason it extracts /bridge
too as .
is not listed in [^a-z0-9%*]
and is found after /bridge
. So how does the look-ahead works: does it have to be a full match, a substr (find
result) or anything else? Does it mean every ending char is expected to be not from the set a-z0-9%*
in this case?
In Rust the code looks as follows:
lazy_static! {
// WARNING: the original regex is `"[^a-z0-9%*][a-z0-9%]{3,}(?=[^a-z0-9%*])"` but Rust's regex
// does not support look-around, so we have to check it programmatically for the last match
static ref REGEX: Regex = Regex::new(r###"[^a-z0-9%*][a-z0-9%]{3,}"###).unwrap();
static ref LOOKAHEAD_REGEX: Regex = Regex::new(r###"[^a-z0-9%*]"###).unwrap();
}
let pattern_lowercase = pattern.to_lowercase();
let results = REGEX.find_iter(&pattern_lowercase);
for (is_last, each_candidate) in results.identify_last() {
let mut candidate = each_candidate.as_str();
if !is_last {
// have to simulate positive-ahead check programmatically
let ending = &pattern_lowercase[each_candidate.end()..]; // substr after the match
println!("searching in {:?}", ending);
let lookahead_match = LOOKAHEAD_REGEX.find(ending);
if lookahead_match.is_none() {
// did not find anything => look-ahead is NOT positive
println!("NO look-ahead match!");
break;
} else {
println!("found look-ahead match: {:?}", lookahead_match.unwrap().as_str());
}
}
...
test output:
"|imasdk":
searching in ".googleapis.com/js/core/bridge*.html"
found look-ahead match: "."
".googleapis":
searching in ".com/js/core/bridge*.html"
found look-ahead match: "."
".com":
searching in "/js/core/bridge*.html"
found look-ahead match: "/"
"/core":
searching in "/bridge*.html"
found look-ahead match: "/"
"/bridge":
searching in "*.html"
found look-ahead match: "."
^ here you can see /bridge
is found due to following .
and it's incorrect.