3

I am trying to capture KEYWORD1 in .NET regex engine based on whether KeyWord2 is present in the string. So far the positive look-around solution I am using:

(?=.*KeyWord2)**KEYWORD1** (\m\i)

RegEx Test Link

only captures KEYWORD1 if KeyWord2 is positioned anywhere behind KEYWORD1 in the string. How can I optimize this in regex so that it captures all instances of KEYWORD1 in the string despite the position of KeyWord2 being ahead, behind or both?

I'd really appreciate some insight.

Thank You

BWEL
  • 41
  • 4
  • Dear Mandy8055 thank you for interest. If you click in the RegEx Test Link provided via rubular.com, you will notice the failed cases. Thank you – BWEL May 16 '20 at 05:07
  • Dear Mandy8055 thank you for your contribution. Unfortunately the .NET RegEx proposed by you, would also select KEYWORD1 in a string without KeyWord2. The idea is to select all instances of KEYWORD1 only when the string has KeyWord2 present at any position before or behind KEYWORD1. Thank you – BWEL May 16 '20 at 05:21
  • 1
    Dear Mandy8055 thank you so much for your contribution. Your solution solved the puzzle and it works flawlessly in the .NET regex engine. – BWEL May 16 '20 at 05:43
  • 2
    You can use: (?=.*KeyWord2).*(KEYWORD1) then 'KEYWORD1' will always be in group1. – Poul Bak May 16 '20 at 05:44
  • 2
    @Mandy8055, yes, you're right about my answer being misposted on the wrong question. I deleted it. I was surprised that someone gave it a downvote (update: 2 downvotes) when it obviously wasn't intended for this question. – Cary Swoveland May 16 '20 at 05:52
  • 2
    @CarySwoveland I feel bad why people downvote so early for any mistakes. This should be avoided somehow –  May 16 '20 at 05:53

2 Answers2

3

You can use the regex below for your requirement:

\bKEYWORD1\b(?:(?<=\bKeyWord2\b.*?)|(?=.*?\bKeyWord2\b))

Explanation of the above Regular Expression:

gi - Use the flags(in order to avoid any case difference) representing: g - global; i - case-insensitive

\b - Represents a word boundary.

(?:) - Represents a non-capturing group.

(?=.*?KeyWord2) - Represents the positive lookahead which matches all KEYWORD1 which are before KeyWord2 read from left to right.

| - Represents alternation; that is it alternates between 1st and 2nd alternating group.(Although, you can wrap them in group.)

(?<=KeyWord2.*?) - Represents infinite(because non-fixed width lazy identifier .*? used) positive lookbehind which matches all KEYWORD1 which are behind of KeyWord2.

You can find the above regex demo here.

NOTE - For the record, these engines support infinite lookbehind:

As far as I know, they are the only ones.

Community
  • 1
  • 1
  • 1
    Also it could be shortened to `KEYWORD1(?:(?<=Keyword2.*?)|(?=.*?Keyword2))`. Maybe a better choice. – revo May 16 '20 at 06:23
  • 1
    @CarySwoveland it matches; Please add `-i` flag and check once. [**Here**](https://regex101.com/r/OHSN3T/4) is the sample run. Please let me know if some other problem persists –  May 16 '20 at 07:10
  • 1
    Excellent answer! Very instructive. (Yes, I forgot the case indifferent flag, or put another way, I had written `"Keyword2"` rather than `"KeyWord2"`.) – Cary Swoveland May 16 '20 at 07:14
  • 2
    A small point: you really don't need "Edit:" in answers. If you circulated a draft of a paper, obtained useful suggestions and incorporated some in a revision, you wouldn't include "Edit:" because you would be trying to write a text that would be most useful to readers, rather than telling a story of how you got to your answer. I think the same applies to SO answers. Put another way, why emphasize an edit? I personally use comments for thank you's and credits, and sometimes for reporting what I had written previously. Again, that's to keep the answer focused on just answering the question. – Cary Swoveland May 16 '20 at 07:37
  • 1
    I edited my answer @CarySwoveland. I added the **word breaks** too. Initially I used to think that all the people who help in a great wiki answer should be given the credit. That is the only reason I mentioned the same. –  May 16 '20 at 07:43
  • I agree, but just think a comment is the best place to do that. Others disagree, however. – Cary Swoveland May 16 '20 at 07:56
  • 2
    @revo, nice one! – Cary Swoveland May 16 '20 at 18:20
0

If one uses a regex engine that supports \G and \K, the following regular expression could be used.

^(?=.*\bKeyWord2\b)|\G.*?\K\bKEYWORD1\b

with the case-indifferent flag and, depending on requirements, multiline flag, set.

PCRE demo

With PCRE (PHP) and some other regex engines the anchor \G matches the end of previous match. For the first match attempt, \G is equivalent to \A, matching the start of the string. See this discussion for details.

\K resets the starting point of the reported match to the current position of the engine's internal string pointer. Any previously consumed characters are not included in the final match. In effect, \K causes the engine to "forget" everything matched up to that point. Details can be found here.

As shown at the link, there are four matches of the string

The KEYWORD1 before KeyWord2 then KEYWORD1 and KEYWORD1 again

They are an empty string at the beginning of the string and each of the three instances of KEYWORD1. In fact for every string matched one of the matches will be an empty string at the beginning of the string. Empty strings must therefore be disregarded when making substitutions.

Cary Swoveland
  • 106,649
  • 6
  • 63
  • 100