1

I have an amount of links like:

  • https%3A%2F%2Fwww.facebook.com%2F
  • https%3A%2F%2Fwww.facebook.com%2F%3Futm_source

I need to capture the text before (%3F) sequence, and capture the whole line, if this sequence does not appear in the line. I want to do it without if-else condition applied for the whole line.

What am I looking for - is for an a way to apply a ? quantifier for the whole character sequence, like this: ^(.*)[\%3F]?

P.S. I know, that there is a way to work-around the problem by transforming the HTML-style characters into single ones first (%2F -> "/" and %3F -> "?") and than applying ? quantifier to a single character, but this is not the way I would like to solve the issue.

1 Answers1

1

You may use

^(?:(?!%3F).)*

that will yield the same results as the following expression:

^.*?(?=%3F|$)

but the most efficient among these is their unrolled counterpart

^[^%]*(?:%(?!3F)[^%]*)*

See the regex demo

Details

  • ^ - start of string
  • (?:(?!%3F).)* - (a tempered greedy token) any char but a line break char (.), zero or more consecutive occurrences , as many as possible (*), that does not start a %3F char sequence
  • .*?(?=%3F|$) - any zero or more chars other than line break chars (.*?), as few as possible, up to but excluding %3F substring or up to the end of string ($).

The ^[^%]*(?:%(?!3F)[^%]*)* pattern follows the unroll-the-loop principle where [^%]* matches any chars other than %, and (?:%(?!3F)[^%]*)* matches 0 or more sequences of % that is not followed with 3F and then 0+ chars other than %. Since the lookahead condition only triggers upon a %, the performance is much better if the string is not overpopulated with % symbols (which should not be the case in real world).

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563