Javascript Regular Expressions - how to NOT match a substring between < and >

Question

I'm using this regular expression:

var regex = /\<.*?.\>/g

to match with this string:

var str = 'This <is> a string to <use> to test the <regular> expression'

using a simple match:

str.match(regex)

and, as expected, I get:

["<is>", "<use>", "<regular>"]

(But without the backslashes, sorry for any potential confusion)

How can I get the reverse result? i.e. what regular expression do I need that does not return those items contained between < and >?

I tried /(^\<.*?\>)/g and various other similar combos including square brackets and stuff. I've got loads of cool results, just nothing that is quite what I want.

Where I'm going with this: Basically I want to search and replace occurences of substrings but I want to exclude some of the search space, probably using < and >. I don't really want a destructive method as I don't want to break apart strings, change them, and worry about reconstructing them.

Of course I could do this 'manually' by searching through the string but I figured regular expressions should be able to handle this rather well. Alas, my knowledge is not where it needs to be!!

You mean it returns `["", "", ""]` ? By the way, no need to escape < and > — Ruan Mendes, Nov 07 '12 at 21:40
Ta, oh yeah, doh! And yup, that is what I mean! Couldn't get the formatting right — Matt Styles, Nov 07 '12 at 21:48
@MattStyles, I know you already picked an answer but check out my solution. Hope it helps! Cheers — cbayram, Nov 07 '12 at 22:39
@cbayram Thanks for the input, that looks like a great solution as well — Matt Styles, Nov 08 '12 at 11:20

Ruan Mendes · Accepted Answer · 2012-11-07T22:34:22.120

Here's a way to do custom replacement of everything outside of the tags, and to strip the tags from the tagged parts http://jsfiddle.net/tcATT/

var string = 'This <is> a string to <use> to test the <regular> expression';
// The regular expression matches everything, but each val is either a
// tagged value (<is> <regular>), or the text you actually want to replace
// you need to decide that in the replacer function
console.log(str.replace( /[^<>]+|<.*?>/g, function(val){
    if(val.charAt(0) == '<' && val.charAt(val.length - 1) == '>') {
      // Just strip the < and > from the ends
      return val.slice(1,-1);
    } else {
      // Do whatever you want with val here, I'm upcasing for simplicity
      return val.toUpperCase(); 
    }
} ));
// outputs: "THIS is A STRING TO use TO TEST THE regular EXPRESSION"

To generalize it, you could use

function replaceOutsideTags(str, replacer) {
    return str.replace( /[^<>]+|<.*?>/g, function(val){
        if(val.charAt(0) == '<' && val.charAt(val.length - 1) == '>') {
          // Just strip the < and > from the ends
          return val.slice(1,-1);
        } else {
          // Let the caller decide how to replace the parts that need replacing
          return replacer(val); 
        }
    })
}
// And call it like
console.log(
    replaceOutsideTags( str, function(val){
        return val.toUpperCase();
    })
);

That looks brilliant! I'll take some proper time to understand how it works tomorrow! I'm almost entirely certain that will solve the real problem I'm working on. Thanks for taking the time! — Matt Styles, Nov 07 '12 at 22:33

score 3 · Answer 2 · answered Nov 07 '12 at 22:28

If I understand correctly you want to apply some custom processing to a string except parts that are protected (enclosed in with < and >)? If, this is the case you could do it like this:

// The function that processes unprotected parts
function process(s) {
    // an example could be transforming whole part to uppercase:
    return s.toUpperCase();
}

// The function that splits string into chunks and applies processing
// to unprotected parts
function applyProcessing (s) {
    var a = s.split(/<|>/),
        out = '';

    for (var i=0; i<a.length; i++)
        out += i%2
                ? a[i]
                : process(a[i]);

    return out;
}

// now we just call the applyProcessing()
var str1 = 'This <is> a string to <use> to test the <regular> expression';
console.log(applyProcessing(str1));
// This outputs:
// "THIS is A STRING TO use TO TEST THE regular EXPRESSION"

// and another string:
var str2 = '<do not process this part!> The <rest> of the a <string>.';
console.log(applyProcessing(str2));
// This outputs:
// "do not process this part! THE rest OF THE A string."

This is basically it. It returns the whole string with the unprotected parts processed.

Please note that the splitting will not work correctly if the angle brackets (< and >) are not balanced.

There are various places that could be improved but I'll leave that as an excersize to the reader. ;p

I'll all over exercises! Thanks dude! – Matt Styles Nov 07 '12 at 22:38 — Matt Styles, Nov 07 '12 at 22:38

score 3 · Answer 3 · answered Nov 08 '12 at 00:09

3

This is a perfect application for passing a regex argument to the core String.split() method:

var results = str.split(/<[^<>]*>/);

Simple!

answered Nov 08 '12 at 00:09

ridgerunner

33,777
5
57
69

score 1 · Answer 4 · answered Nov 07 '12 at 21:40

1

Using the variables you've already created, try using replace. It's non-destructive, too.

str.replace(regex, '');
--> "This  a string to  to test the  expression"

answered Nov 07 '12 at 21:40

Brian Ustas

62,713
3
28
21

How do you get the matched stuff back after the OP replaces what they need (outside of the tags) though? The OP wants to keep it – Ruan Mendes Nov 07 '12 at 21:42
Thanks, I got to that stage but what I really want to do is replace 'test' with 'TADA' in that string but still have the stuff in <> included. I could just search for 'test' of course, but it might be contained in < and > so would still get replaced. Basically I want to replace 'test' everywhere that IS NOT within < > – Matt Styles Nov 07 '12 at 21:49
@MattStyles Gotcha. To solve the problem you just described, try `str.replace(/([^<])test([^<])/g, '$1TADA$2');` There's probably a better way, however... – Brian Ustas Nov 07 '12 at 22:00
I was thinking along the lines of using $1 etc in some way. That method works nice but I tested it with this string 'str = "This a test that is this regex is the right one" 'and it, unfortunately, changes the 'test' within < > so I end up with ""This a TADA that is this regex is the right one" – Matt Styles Nov 07 '12 at 22:05

score 1 · Answer 5 · answered Nov 07 '12 at 22:37

1

/\b[^<\W]\w*(?!>)\b/g

This works, test it out:

var str = 'This <is> a string to <use> to test the <regular> expression.';
var regex = /\<.*?.>/g;
console.dir(str.match(regex));
var regex2 = /\b[^<\W]\w*(?!>)\b/g;
console.dir(str.match(regex2));

answered Nov 07 '12 at 22:37

cbayram

2,259
11
9

AHM · Answer 6 · 2012-11-07T22:33:49.353

-1

Ah, okay, sorry - I misunderstood your question. This is a difficult problem to solve with pure regular expressions in javascript, because javascript doesn't support lookbehinds, and usually I think I would use lookaheads and lookbehinds to solve this. A (sort of contrived) way of doing it would be something like this:

str.replace(/((?:<[^>]+>)?)([^<]*)/g, function (m, sep, s) { return sep + s.replace('test', 'FOO'); })

// --> "This <is> a string to <use> to FOO the <regular> expression"

This also works on strings like "This test <is> a string to <use> to test the <regular> expression", and if you use /test/g instead of 'test' in the replacer function, it will also turn

"This test <is> a string to <use> to test the test <regular> expression"

into

"This FOO <is> a string to <use> to FOO the FOO <regular> expression"

UPDATE

And something like this would also strip the <> characters:

str.replace(/((?:<[^>]+>)?)([^<]*)/g, function (m, sep, s) { return sep.replace(/[<>]/g, '') + s.replace(/test/g, 'FOO'); })

"This test <is> a string to <use> to test the test <regular> expression"
--> "This FOO is a string to use to FOO the FOO regular expression"

edited Nov 07 '12 at 22:33

answered Nov 07 '12 at 21:47

AHM

5,145
34
37

How is this any different than what depot posted? How's the OP supposed to get the `["", "", ""]` back into the string. This gets a -1 because it duplicated the problem of an existing answer, prove me wrong and you get the upvote – Ruan Mendes Nov 07 '12 at 21:49
Okay, sorry, I deserved that -1. I guess I didn't understand the questing before read the comments left at depot's answer. I have updated my answer to something better :-) – AHM Nov 07 '12 at 22:22
Ah, okay, and it also needs to strip the separator characters :-) – AHM Nov 07 '12 at 22:25
I think my answer is doing what the OP asked for, what do you think? – Ruan Mendes Nov 07 '12 at 22:25

Agnislav · Answer 7 · 2012-11-07T23:24:22.833

-1

Try this regex:

\b\w+\b(?!>)

UPDATE

To support spaces inside brackets try this one. It's not pure regex.match, but it works and it's much simpler that the answer above:

alert('This <is> a string to <use use> to test the <regular> expression'.split(/\s*<.+?>\s*/).join(' '));

edited Nov 07 '12 at 23:24

answered Nov 07 '12 at 21:53

Agnislav

299
1
9

What? Please explain, I can't make sense of it, tempted to vote down – Ruan Mendes Nov 07 '12 at 21:57
I just get back, which makes sense to me. Can't get it to help me though, what am I missing? – Matt Styles Nov 07 '12 at 21:59
It's just a look-ahead assertion. We catch all content in < > except strings listed after ?! and separated by pipe | . Look [this link](http://stackoverflow.com/questions/611883/regex-how-to-match-everything-except-a-particular-pattern) for additional reference – Agnislav Nov 07 '12 at 21:59
Sorry, it seems I misunderstood the question. I thought you'd like to get all entries except few pre-defined. Changed. – Agnislav Nov 07 '12 at 22:19
That's brilliant! However, when I test with a complex str such as ""This a test that is this regex is the test right one" then it struggles with the space between the middle < > . Do I need to add a \s somewhere? – Matt Styles Nov 07 '12 at 22:27
Oh. Using spaces makes the regexp much more complicated. I even not sure that I can do this using js regexp. With perl - np =) – Agnislav Nov 07 '12 at 23:03
Added one more solution. Try it! – Agnislav Nov 07 '12 at 23:24
That works great for the simple example but the splitting and joining deforms the original string and I eventually need to run another replace, but only on the bits not in the <> so I can't have any other deformation of the string. A pain! Looks like the solutions above will work though but thanks very much for taking the time to answer, it's all good to look at to learn more about regex – Matt Styles Nov 08 '12 at 11:40

Javascript Regular Expressions - how to NOT match a substring between < and >

7 Answers7