-2

I am trying to match a specific string, but only when it's not part of a couple specific literal strings. I wish to exclude results falling within the literal strings <span class='highlight'> and </span>. So if I search for "light", "high", "pan", "an", etc. I want to match any other occurrences that are not part of those two literals.

I'm not trying to parse full HTML, only those two strings listed, which will never change. The class value will never change from 'highlight'.

I have tried all manners of lookarounds, capturing groups, non-capturing groups, etc that I can think of and have come up with nothing. Lookarounds don't seem to be working, I'm betting because the position(s) of the string in relation to the cases to be excluded are not guaranteed to be in a certain order.

Is this possible with only regex?

HotN
  • 4,216
  • 3
  • 40
  • 51
  • Possible duplicate of [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) –  Feb 09 '17 at 21:02
  • @Jack Maney I think this is a case of the second answer to the linked question: http://stackoverflow.com/a/1733489/764371. This is a case, "...it's sometimes appropriate to parse a limited, known set of HTML" applies. I'm interested in excluding matches that happen within the literal string `` and ``. I don't care about any other strings or html tags. The class name will never change either. – HotN Feb 09 '17 at 21:47
  • Updated question to try to clarify what I'm trying to match without it looking like full HTML parsing. – HotN Feb 09 '17 at 21:55
  • Other than within those tags, are you looking for the string to search anywhere in particular? Only between those tags? Only outside those tags? Or just not within? Are there other tags, and do you want to match those or not? – jcaron Feb 09 '17 at 22:05

2 Answers2

0

Would this method work for you?

  1. Search-and-replace those two tags with the empty string:

    s/(<span class='highlight'>|<\/span>)//g
    
  2. Search for your string

Of course you might end up with your search string being "around" one of those bits, e.g. searching for abcd and matching ab</span>cd. You could get around that my replacing with some character sequence you are sure is not something that can be searched for.

You'll also lose the context of the situation of the string you're looking for relative to those tags, but not knowing what you're trying to achieve exactly, it's difficult to say whether that is important for you or not.

jcaron
  • 17,302
  • 6
  • 32
  • 46
  • I thought I was providing all relevant details in my question, but I was wrong. Replacing a repeated regex replace with a single one turned out to be my solution. Thank you for trying to help though! – HotN Feb 10 '17 at 15:42
0

Oops, I thought I was properly simplifying my question, but it turns out I was wrong. I inherited code that was taking a string and doing a regex replace on a list of search terms by looping through them one at a time and wrapping matches in <span class="highlight"></span>. That resulted in a phrase like "Look into the light" ending up looking incorrect if you searched for "the light". "the" was matched and replaced, then "light" was matched, but would match the newly replaced tag for "the". The trick wasn't to fix the regex that got run on each individual word, but to change it into a regex that processed all of them together. Rather than regex replace using the, then light, the regex just needed to be the|light.

HotN
  • 4,216
  • 3
  • 40
  • 51