4

I'm trying to find certain code portions in a Visual Studio 2013 project. I'm using the RegEx search function for that (I check "Use Regular Expressions" under Search Options).

More specificly, I'm trying to find the string "findthis" (without quotes) that lies between an opening and a closing script tag. The RegEx should be able to match the string multi-line.

Example:

<html>
    <head>
        <script>
            var x = 1;

            if (x < 1) {
                x = 100;
            }

            var y = 'findthis'; // Should be matched
        </script>
    </head>
    <body>
        <script>
            var a = 2;
        </script>

        <h1>Welcome!</h1>
        <p>This findthis here should not be matched.</p>

        <script>
            var b = 'findthis too'; // Should be matched, too.
        </script>

        <div>
            <p>This findthis should not be matched neither.</p>
        </div>
    </body>
</html>

What I've tried so far is the following (the (?s) enables multi-line):

(?s)\<script\>.*?(findthis).*?\</script\>

The problem here is that it does not stop searching for "findthis" when a script end tag occurs. That's why, in Visual Studio 2013, it also shows the script element right after the body opening tag in the search results.

Can anyone help me out of this RegEx hell?

thomaskonrad
  • 665
  • 1
  • 9
  • 24
  • Regex isn't suitable for parsing HTML unfortunately. – Evan Knowles Apr 10 '15 at 09:50
  • It does not need to be a strictly correct parsing of HTML. I just want to match a string that occurs somewhere between string x and string y. And the strings x and y should be able to occur various times in the text. So it should stop searching at string y, I guess that's the hard part here. – thomaskonrad Apr 10 '15 at 09:55
  • Do you always know what tag the search word is? – Wiktor Stribiżew Apr 10 '15 at 10:02

3 Answers3

5

You can use this regex to avoid matching <script> tags:

<script>((?!</?script>).)*(findthis)((?!</?script>).)*</script>

Or, a more effecient one with atomic groupings:

<script>(?>(?!</?script>).)*(findthis)(?>(?!</?script>).)*</script>

I am assuming we do not want to match neither opening, nor closing <script> tags in between, so, I am using /? inside (?>(?!</?script>).)*, just to avoid any other malformed code. I repeat it after (findthis) again, so that we only match characters that are not followed by either <script> or </script>.

Tested in Expresso with a slightly modified input (I added < and > everywhere to simulate corruptions):

enter image description here

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • This works, thanks a lot! Would you be so kind as to explain what the second question mark in ```((?!?script>).)``` means and why I need this directive a second time after the ```(findthis)``` portion? I'll be happy to mark your answer as the accedpted one then. – thomaskonrad Apr 10 '15 at 10:21
  • I added my explanation. I can hardly think of any malformed example now, that is why I am playing safe and just match every character in-between 1 pair of `` tags. I am open to any improvements. – Wiktor Stribiżew Apr 10 '15 at 10:26
  • 1
    See this [explanation](http://stackoverflow.com/a/6259570/1686094) and [this explanation](http://stackoverflow.com/a/406408/1686094) – Aaron Apr 10 '15 at 10:29
2

Built off of @Aaron's answer:

\<script\>(?:[^<]|<(?!\/script>))*?(findthis).*?\<\/script\>

Regular expression visualization

Debuggex Demo

So you can see I do (?:[^<]|<(?!\/script>)) to say "match anything that isn't a <, or a < that isn't followed by /script>".

asontu
  • 4,548
  • 1
  • 21
  • 29
1

Maybe this works

(?s)\<script\>[^<]*?(findthis).*?\</script\>

The [^<]*? part will avoid matching another tag before it match findthis.

See https://www.regex101.com/r/pV7iY6/1

Aaron
  • 2,383
  • 3
  • 22
  • 53
  • you should escape that backslash in closing script-tag too `...\<\/script\>` – Carnivorus Apr 10 '15 at 10:01
  • 1
    That should work unless you have any comparison like `x < 5` in your code. – SGD Apr 10 '15 at 10:02
  • This indeed works in some cases. But can I somehow extend the ```[^<]``` part to not match a string instead of a single character? (I'm asking this because opening angle brackets also regularly occur inside JavaScript.) – thomaskonrad Apr 10 '15 at 10:03
  • I have edited my question to include that special case. – thomaskonrad Apr 10 '15 at 10:17