4

It was already asked here, but the asker got satisfied with a 2 character finding answer. I repeat his basic question:

Generally, is there any way, how to say not contains string in the same way that I can say not contains character with [^a]?

I want to create a regexp that matches two ending strings and everything between, but only if no other occurance of a given string is found inside. But I will be satisfied best with the general answer to the quoted question

Example:

The strings are "<script>"and"</script>"

It should match

"<script> something something </script>"

but not

"<script> something <script> something something </script>"
Community
  • 1
  • 1
naugtur
  • 16,827
  • 5
  • 70
  • 113
  • 1
    Are you trying to parse HTML? If so, you should better use an HTML parser. – Gumbo Feb 25 '10 at 09:25
  • No, I'm trying to filter out some stuff. This is just an example – naugtur Feb 25 '10 at 09:29
  • If you are trying to filter or sanitize html, you should still use a parser – Otto Allmendinger Feb 25 '10 at 11:56
  • It's just for removing stuff for viewrew comfprt, yet still - any suggestions for a parsers written in javascript? :P [I too recommend parsers when I see a server side regex on html] – naugtur Feb 25 '10 at 12:16
  • 1
    @naugtur: when stuff you remove is html, you are better of with a parser. There are js html parsers out there http://www.google.com/search?q=javascript+html+parser – Otto Allmendinger Feb 25 '10 at 13:17
  • +1 I didn't expect that. [noone expects the spanish inquisition...] – naugtur Feb 25 '10 at 13:52
  • Yet still what I needed is easily (and at this moment already, thx) done with a nice regexp. I just show the content without scripts for a moment. The regexp output is never saved anywhere. – naugtur Feb 25 '10 at 13:55
  • Anybody fancy comparing Alan's and Otto's expressions? I sense a little difference in behaviour, but I'm not sure. I've done what I needed, so it'd be for the future generations ;) – naugtur Feb 25 '10 at 13:59
  • naugtur: Alan's expression doesn't match when `` is in the middle, mine is only filters ` – Otto Allmendinger Feb 25 '10 at 22:30
  • Yeah, this I get. I was curious about the (?:(?! versus just (?!. But they're probably the same. I think it's just me lacking some reading on regex. BTW. Your willingness to answer it to the last bit is what I like here in stackoverflow ;) – naugtur Feb 25 '10 at 22:44
  • Please take a look [this](http://stackoverflow.com/questions/436850/matching-a-line-that-doesnt-contain-specific-text-with-regular-expressions) question – YOU Feb 25 '10 at 09:09
  • Yeah, I didn't find that. it started with matching a line, and i must have skipped reading the rest of it ;) – naugtur Feb 25 '10 at 09:32

3 Answers3

4

Did you read my answer to that question? It gives a more general solution. In your case it would look like this:

(?s)<script>(?:(?!</?script>).)*</script>

In other words: match the opening sequence; then match one character at a time, after ensuring that it's not the beginning of the closing sequence; then match the closing sequence.

Alan Moore
  • 73,866
  • 12
  • 100
  • 156
  • I still don't understand what is going on in the parentheses and why they don't match, but I'll figure it out. thanx – naugtur Feb 25 '10 at 09:31
  • 1
    This regex has unbalanced paranthesis. When I fix the expression, it doesn't match either of the strings. – Otto Allmendinger Feb 25 '10 at 09:32
  • @naugtur, I fixed the missing parenthesis. It might still not work, in which case your start and end tags are probably on separate lines. Try appending `(?s)` in front of the proposed regex, which will let the DOT meta char also match lines breaks: `(?s)).)*` – Bart Kiers Feb 25 '10 at 10:30
  • Mea culpa! I should have tested it, even if I *have* posted it a dozen times before. Thanks, Bart. – Alan Moore Feb 25 '10 at 10:39
  • No problem Alan, it's comforting to see guys like you also make these (little) mistakes! ;) – Bart Kiers Feb 25 '10 at 10:49
  • The negative lookahead should be for `` – Otto Allmendinger Feb 25 '10 at 11:55
  • @Otto: actually, it should be for both: `(?!?script>)`; that matches the innermost set of possibly nested tags. Of course, ` – Alan Moore Feb 25 '10 at 12:38
  • In practice it's for if You assume the tags have any sense. It's my example that is rather silly ;) First thing I've changed when using it was looking for ! – naugtur Feb 25 '10 at 22:37
1

The correct expression for your problem is

"^<script>((?!<script>).)*</script>$"

This shouldn't be used for html manipulation. This doesn't address cases like

<script> foo <script type="javascript"> bar </script>

and many others. A parser is the correct solution here.

The more general expression for matching strings beginning with START, ending with END without the specific character sequence foobar in-between is:

"^START((?!foobar).)*END$"
Otto Allmendinger
  • 27,448
  • 7
  • 68
  • 79
  • I tuned it up and the input is a bit different, so there is no need to worry about html content. – naugtur Feb 25 '10 at 13:54
1

Use negative lookahead. Lookarounds give zero width matches - meaning that they don't consume any characters in the source string.

var s1 = "some long string with the CENSORED word";
var s2 = "some long string without that word";
console.log(s1.match(/^(?!.*CENSORED).*$/));//no match
console.log(s2.match(/^(?!.*CENSORED).*$/));//matches the whole string

The syntax for negative lookahead is (?!REGEX). It searches for the REGEX and returns false if a match is found. Positive lookahead (?=REGEX) returns true if a match is found.

Amarghosh
  • 58,710
  • 11
  • 92
  • 121