0

I have just started using mark.js (https://markjs.io), and right now I am trying to find the right RegEx to capture as little data as possible (non-greedy type), and no more than a certain number of characters.

I tried multiple options, and so far I have these three regular expressions, but each has its own faults:

  1. w(.{1,30})?3 - captures 'word1 word2 word3 wo rd3', instead of 'word3' and 'wo rd3';

  2. w(\w{1,30})?3 - captures 'word3' as it should, but fails for 'wo rd3';

  3. w((\w| ){1,30})?3 - this behaves exactly like the 1st option above.

For a better understanding, please run the code below.

What do you think, what am I missing here, please?

Alex

var regex = /w(.{1,30})?3/i;
      // var regex = /w(\w{1,30})?3/i;
      // var regex = /w((\w| ){1,30})?3/i;

      var instance = new Mark(document.querySelector("body"));
      instance.markRegExp (regex, {
        className: "mark"
      });
.mark {
        color: white;
        background: red;
      }
<script src="https://cdn.jsdelivr.net/mark.js/8.8.3/mark.min.js"></script>


    <div>word1 word2 word3 wo rd3 word4</div>
Alex
  • 85
  • 7

1 Answers1

1

w(.{1,30})?3 - captures 'word1 word2 word3 wo rd3', instead of 'word3' and 'wo rd3';

Yes, because .{1, 30} means capture up to 30 of any character (other than newlines). And since you have only 22 characters between the first w and the last 3, it will match everything.

w(\w{1,30})?3 - captures 'word3' as it should, but fails for 'wo rd3';

Yes, because \w only matches word characters, not whitespace.

w((\w| ){1,30})?3 - this behaves exactly like the 1st option above.

Yes, because (\w | ) is nearly identical to .. (. will also match \t and other kinds of whitespace.)

If you want to match anything starting with a w and ending with 3, with at most one space in between, you can use:

w\w+?(\s\w+?)?3

The +? indicates the "non-greedy" match type you're looking for. However, this regex will also match word2 word3. If anything ending with a number indicates the end of a potential match, you can instead use:

w[a-zA-Z]+?(\s[a-zA-Z]+?)?3

And since you've used the /i flag already, [a-zA-Z] can be further simplified to just [a-z].

var regex = /w[a-z]+?(\s[a-z]+?)?3/ig;
var instance = new Mark(document.querySelector("body"));
instance.markRegExp (regex, {
  className: "mark"
});
.mark {
        color: white;
        background: red;
      }
<script src="https://cdn.jsdelivr.net/mark.js/8.8.3/mark.min.js"></script>

<div>word1 word2 word3 wo rd3 word4</div>
jdaz
  • 5,964
  • 2
  • 22
  • 34
  • To simplify things, I am looking for this RegEx: 'start_keyword (any character, but no more than 30, non-greedy) end_keyword'. The start/end keywords are usually regular words, although they might include non-alphanumeric characters too. To give you a specific example, let's say I have two sentences: 1) 'these flowers are beautiful', 2) 'these flowers are very beautiful'. In this case, I'd like to match both sentences with a single RegEx such as 'flowers (non-greedy RegEx to capture less than 30 characters) beautiful'. – Alex Aug 01 '20 at 08:33
  • In that case how about simply `flowers.{1,30}?beautiful`? This is your first suggestion, without parentheses to cause lazy matching. – jdaz Aug 01 '20 at 08:37
  • 1
    Oh, everything is clear now! I was under the impression that `.{1,30}?` doesn't act as non-greedy, but what I was missing was the fact that there were only 22 characters between the first w and the last 3 (i'm referring to my initial example now). You pointed that to me in your initial answer, but I failed to quickly realize that was main thing I was missing. All clear now, so thank you very much. Very helpful! – Alex Aug 01 '20 at 09:04