0

I'm trying to return a count of all words NOT between square brackets. So given ..

[don't match these words] but do match these

I get a count of 4 for the last four words.

This works in .net:

\b(?<!\[)[\w']+(?!\])\b

but it won't work in Javascript because it doesn't support lookbehind

Any ideas for a pure js regex solution?

Chris Webb
  • 752
  • 7
  • 22
  • It is surprising that it worked for you in .net in the first place - I can't see anything in the regex that would stop it from matching 'match/these/'t even - unless you ever had only 2 words in the brackets – Joanna Derks May 04 '12 at 12:23

3 Answers3

5

Ok, I think this should work:

\[[^\]]+\](?:^|\s)([\w']+)(?!\])\b|(?:^|\s)([\w']+)(?!\])\b

You can test it here:
http://regexpal.com/

If you need an alternative with text in square brackets coming after the main text, it could be added as a second alternative and the current second one would become third.
It's a bit complicated but I can't think of a better solution right now.

If you need to do something with the actual matches you will find them in the capturing groups.

UPDATE:

Explanation: So, we've got two options here:

  1. \[[^\]]+\](?:^|\s)([\w']+)(?!\])\b

This is saying:

  • \[[^\]]+\] - match everything in square brackets (don't capture)
  • (?:^|\s) - followed by line start or a space - when I look at it now take the caret out as it doesn't make sense so this will become just \s
  • ([\w']+) - match all following word characters as long as (?!\])the next character is not the closing bracket - well this is probably also unnecessary now, so let's try and remove the lookahead
  • \b - and match word boundary

2 (?:^|\s)([\w']+)(?!\])\b

If you cannot find the option 1 - do just the word matching, without looking for square brackets as we ensured with the first part that they are not here.

Ok, so I removed all the things that we don't need (they stayed there because I tried quite a few options before it worked:-) and the revised regex is the one below:

\[[^\]]+\]\s([\w']+)(?!\])\b|(?:^|\s)([\w']+)\b
Joanna Derks
  • 4,033
  • 3
  • 26
  • 32
1

I would use something like \[[^\]]*\] to remove the words between square brackets, and then explode by spaces the returned string to count the remaining words.

sp00m
  • 47,968
  • 31
  • 142
  • 252
  • I'm ideally looking for a regex that does it in one hit. I've already got a 2 step solution but I'm looking for efficiency because it's run every time someone types in a txtbox. Thanks for responding – Chris Webb May 04 '12 at 11:51
  • Wow, using only one regex in JS is quite difficult in fact, I can't find a solution yet! – sp00m May 04 '12 at 12:23
0

Chris, resurrecting this question because it had a simple solution that wasn't mentioned. (Found your question while doing some research for a general question about how to exclude patterns in regex.)

Here's our simple regex (see it at work on regex101, looking at the Group captures in the bottom right panel):

\[[^\]]*\]|(\b\w+\b)

The left side of the alternation matches complete [bracketed groups]. We will ignore these matches. The right side matches and captures words to Group 1, and we know they are the right words because they were not matched by the expression on the left.

This program shows how to use the regex (see the count result in the online demo):

<script>
var subject = '[match ye not these words] but do match these';
var regex = /\[[^\]]*\]|(\b\w+\b)/g;
var group1Caps = [];
var match = regex.exec(subject);

// put Group 1 captures in an array
while (match != null) {
    if( match[1] != null ) group1Caps.push(match[1]);
    match = regex.exec(subject);
}


document.write("<br>*** Number of Matches ***<br>");
document.write(group1Caps.length);

</script>

Reference

How to match (or replace) a pattern except in situations s1, s2, s3...

Community
  • 1
  • 1
zx81
  • 41,100
  • 9
  • 89
  • 105