Regular Expressions: Using a negative look ahead for the nonsupported negative look behind and capturing the look behind characters upon split

Question

I'm struggling again with regular expressions. I've been trying to add the use of an escape character to escape a custom tag such as <1> to <57> and </1> to </57>. With the help of Georg, here, the following expression produces the desired result prior to attempting an escape method.

('This is a <21>test</21> again.').split(/(<\/?(?:[1-9]|[1-4][0-9]|5[0-7])>)/);

generates 'This is a ', '<21>', 'test', '</21>', ' again.'

This question has one suggestion of using a negative look ahead and an OR to approximate the unsupported negative look behind. I modified that example for what I thought was my simpler problem; however, I'm stumped again.

('This is a <21>test</21> again.').split(/(?:(?!\\).|^)(<\/?(?:[1-9]|[1-4][0-9]|5[0-7])>)/) );

generates 'This is a', '<21>', 'tes', '</21>', ' again.' So, it does not include the character just previous to <21> or </21> when not a \. And I see why since used the ?: for non-capture.

However, if it's removed then:

('This is a <21>test</21> again.').split(/((?!\\).|^)(<\/?(?:[1-9]|[1-4][0-9]|5[0-7])>)/) );

generates 'This is a', ' ', '<21>', 'tes', 't', '</21>', ' again.' And the previous character generates a separate split.

Apart from this problem, the escaping works such that when the previous character is a \ the tag doesn't generate a split of the string.

Could you please let me know if there is a way to capture the previous character but include it with the text of the previous string rather than its own split? And possibly exclude it only when a \?

When the string is 'This is a <21>test</21> again.', the desired result is 'This is a ', '<21>', 'test', '</21>', ' again.'

And when it is 'This is a \<21>test</21> again.', the desired result is 'This is a <21>', 'test', '</21>', ' again.'

Thank you.

Addition After recently learning about using an in-line function as a parameter in a replace operation using a regular expression at this MDN document, I started to wonder about whether or not something similar could be done here. I don't know anything about measuring performance but the complexity of the regular expression provided by Revo below and his answer to my comment about efficiency stating that a negative look behind would be a significant improvement in efficiency and less work for the RegExp engine, and also that RegExp is something of a black-box behind-the-scenes mystery to me, motivated me to experiment with another approach. It's a couple more lines of code but produces the same result and uses a much shorter regular expression. All it really does is match the tags, both with and without an escape character, rather than trying to exclude those escaped with a \, and then ignores the ones with an escape character in building the array. Snippet below.

I don't know if the times provided in the console log are indicative of performance' but, if so, in the examples I ran, it appears that the difference in time between logging start and a.split is considerably longer as a percentage than that between a.split and the final logging of array a under the exec approach.

Also, the inner most if block within the while statement is there to prevent a "" from being saved in the array when a tag is at beginning or end of the string, or when there is no space between two tags.

I'd appreciate any insight you may be able to provide concerning why or why not to use one approach over the other, or introducing a better method for the case of not having access to a true negative look behind. Thank you.

let a, i = 0, l, p, r,
    x = /\\?<\/?(?:[1-9]|[1-4]\d|5[0-7])>/g,
    T = '<1>This is a <21>test<21> of \\<22>escaped and \\> </ unescaped tags.<5>';

console.log('start');

a = T.split(/((?:[^<\\]+|\\+.?|<(?!\/?(?:[1-9]|[1-4]\d|5[0-7])>))+|<\/?(?:[1-9]|[1-4]\d|5[0-7])>)/).filter(Boolean);

      console.log(a);
      a=[];
      while ( ( r = x.exec( T ) ) !== null) {
        if ( r[0].charAt(0) !== '\\' )
          {
             if ( r.index === 0 || r.index === p )
               {
                 a[ i ] = r[0];
                 i = i + 1;
               }
             else 
               {
                 a[ i ] = T.substring( p, r.index );
                 a[ i + 1 ] = r[0];
                 i = i + 2;
               }; // end if
             p = x.lastIndex;
          }; // end if
      }; // next while

      if ( p !== T.length ) a[i] = T.substring( p );
      console.log(a)

revo · Accepted Answer · 2019-04-13T19:05:08.363

2

You are splitting on desired sub-strings and use a capturing group to have them in output. This could be happened about undesired sub-strings too. You match them and enclose them in a capturing group to have them in output. The regex would be:

(undesired-part|desired-part)

Regex for undesired sub-strings should come first because desired ones could be found in them i.e. <21> is included in \<21> so we should match the latter earlier.

You wrote the desired part and it is known to us:

(undesired-part|<\/?(?:[1-9]|[1-4]\d|5[0-7])>)

So what about undesired? Here it is:

(?:[^<\\]+|\\.?|<(?!\/?(?:[1-9]|[1-4]\d|5[0-7])>))+

Let's break it down:

(?: Start of non-capturing group
- [^<\\]+ Match anything except < and \
- | Or
- \\.? Match an escaped character
- | Or
- <(?!\/?(?:[1-9]|[1-4]\d|5[0-7])>) Match a < which is not desired
)+ End of NCG, repeat as much as possible and at least once

Overall it is:

((?:[^<\\]+|\\.?|<(?!\/?(?:[1-9]|[1-4]\d|5[0-7])>))+|<\/?(?:[1-9]|[1-4]\d|5[0-7])>)

Js code:

console.log(
  'This is a \\<21>test</21> ag<ain\\.'.split(/((?:[^<\\]+|\\.?|<(?!\/?(?:[1-9]|[1-4]\d|5[0-7])>))+|<\/?(?:[1-9]|[1-4]\d|5[0-7])>)/).filter(Boolean)
);

edited Apr 13 '19 at 19:05

answered Apr 13 '19 at 07:17

revo

47,783
14
74
117

Thank you for taking the time to figure this out and providing an explanation. I understand in part but not fully. The undesired part is every possibility other than tags 1 to 57. It could be no tag or escape, an escape, or a tag other than 1 to 57. The undesired is like the universe less the desired part; and since connected by an OR it appears that everything should be a match. I can't see the negative or NOT portion except the NCG. Nonetheless, I added to my code and it appears to work perfectly. I could not find an expression that caused it to provide undesired results. Thank you. – Gary Apr 13 '19 at 18:42
Hypothetically, if there were a negative look-back supported, would it be more efficient? Or, perhaps, I should ask does having everything match and not capturing the largest portion result in more work for the computer? I'm not implying that there is a better way than that which you provided, but am only trying to understand. I don't have a clue how the regular expressions are processed when searching through a string; so, as far as I know, this method could be just as or more efficient than a negative look-back. Thank you. – Gary Apr 13 '19 at 18:50
@Gary The undesired part doesn't relate to tags only. It consumes everything that isn't desired and yes, in case of a lookbehind it means so much efficiency and less work for regex engine to find a match or reject an attempt. ECMA2018 supports lookbehinds and Chrome is the only browser that implements the standard. So you are able to use JS code in Chrome that invokes a lookbehind. I made a small modification to my answer. Please check. – revo Apr 13 '19 at 19:05
Thank you for the further explanation and informing of the modification to your answer. – Gary Apr 13 '19 at 19:10
I just noticed something. A code of <212> generates a split when it shouldn't. I think it is the /d before the |. – Gary Apr 13 '19 at 19:14
What do you get and what do you expect? – revo Apr 13 '19 at 19:16
Sorry, it has nothing to do with the \d. A code of <212> generates a split, such as 'This is a ', '<212>', ' test.'. I thought the [1-4]\d, etcetera would limit it to 1 to 57. I must be misunderstanding something. So, output would be 'This is a <212>test.'. – Gary Apr 13 '19 at 19:21
Please disregard that. I'm an idiot. I had the <212>< wrapped inside acceptable tags and it appeared as text but I thought it was a break. I apologize. It is still working just fine. – Gary Apr 13 '19 at 19:26
@Gary No problem. – revo Apr 13 '19 at 19:29
I did find the need for one modification although rather unimportant (because I don't know why anyone would ever type text in this manner) but perhaps necessary for completeness. If a tag is preceded by two escapes, such as '\\<21>', both are treated as regular text and the <21> generates a split in the string. Placing a '+' in the '|\\.?|' to make it '|\\+.?|' seems to have fixed it without messing up the rest of it. Thanks. – Gary Apr 14 '19 at 04:44

Regular Expressions: Using a negative look ahead for the nonsupported negative look behind and capturing the look behind characters upon split

1 Answers1

Linked