0

I want to remove some scripts from pages that contain the word "site"

<scritp>
o.com
bla bla bla
</script><p>this is line></p>

<script>
google.com/jquery.js !
</script>

<scritp>
site.com
bla bla bla
</script><p>aa</p>

CONTENT
STYLE
SIDEBAR
...


<scritp>
site.com
aaa bla bla bla
</script><p>a</p>

I am using the following regular expression:

<scritp>.*?site.*?<\/script>

But it contains lines that are not related to the matches.

debug link : https://regex101.com/r/rC0vF8/2

How can I stop when I find a match for: </script>

I want to match all <script>site.com</script> at once

BrokenBinary
  • 7,731
  • 3
  • 43
  • 54
user3325376
  • 198
  • 1
  • 16

2 Answers2

1

Confusing looks, that you have some scritp and some script in your sample & demo. Is this meant? However you can use a negative lookahead if this would be convenient for your input:

<script>((?!</script).)*?site(?1)*</script>
  • ((?!</script).)*? matches lazily any amount of any characters while </script not ahead
  • until site and (?1)* reuses the pattern in first group until </script> greedily.

More explanation and demo at regex101

For this kind of problems usually a parser solution is be to be preferred. Depends on input.

Community
  • 1
  • 1
bobble bubble
  • 16,888
  • 3
  • 27
  • 46
  • your code perfectly worked in regex101.com , but unfortunately it's not working in php , it's make page never-end loading , you can test it in code generator [link](http://regex101.com/r/eN8oW9/1) – user3325376 Jun 01 '16 at 22:00
  • @user3325376 Hmm, it works for me. Try [`$str = preg_replace('~~si', "__REMOVED__", $str);`](https://eval.in/581644). Maybe you had `/` as pattern delimiter or some limits set too low in php.ini – bobble bubble Jun 01 '16 at 22:07
  • This regex is more readable, but less efficient than what I suggested in the comment. A tempered greedy token should be greedy, shouldn't it? :-) Anyway, I suggest unrolling these constructs. – Wiktor Stribiżew Jun 02 '16 at 07:02
  • @WiktorStribiżew Why not just unroll your answer? Imho throwing in a huge regex comment without any explanation does not really help. I put the answer because I thought it's compromise between performance and readability / not so hard to understand. I know that your regex is of better performance. If you think you learnt some nice terminology from rexegg or elsewhere for a common practice that's good for you (: I just wrote that from feeling. If it's wrong or can be improved let me know. – bobble bubble Jun 02 '16 at 11:58
0

Use this regex instead: /<scritp>\nsite.*?<\/script>/gsi

Your regex will fetch the first <script> then next site.* then last </script>

Felippe Duarte
  • 14,901
  • 2
  • 25
  • 29