4

Hello I am trying to figure out a regular expression to replace text in an innerHTML block to provide local formatting for text similar in operation to Google IM.

Where: 
_Italics_
!Inderline!
*Bold*
-Strike-

Part of the conditions is that the text must be wrapped by the symbol, but if a space follows immediately after then the trigger condition is voided; so * bold* would not be bolded and: * notboldbut this is bold

The innerHTML will have URLS which have already been converted to hrefs so in order to not mess with them, I have added the following to the front of my regex.

    (?!(?!.*?<a)[^<]*<\/a>)

The following javascript does not capture all the results and will have varied results depending on the order in which I conduct the replace.

var boldPattern          = /(?!(?!.*?<a)[^<]*<\/a>)\*([^\s]+[\s\S]?[^\s]+)\*([\s_!-]?)/gi;
var italicsPattern       = /(?!(?!.*?<a)[^<]*<\/a>)_([^\s]+[\s\S]?[^\s]+)_([\s-!\*]?)/gi;
var strikethroughPattern = /(?!(?!.*?<a)[^<]*<\/a>)-([^\s]+[\s\S]?[^\s]+)-([\s_!\*]?)/gi;
var underlinePattern     = /(?!(?!.*?<a)[^<]*<\/a>)!([^\s]+[\s\S]?[^\s]+)!([\s-_\*]?)/gi;
str = str.replace(strikethroughPattern, '<span style="text-decoration:line-through;">$1</span>$2');
str = str.replace(boldPattern, '<span style="font-weight:bold;">$1</span>$2');
str = str.replace(underlinePattern, '<span style="text-decoration:underline;">$1</span>$2');
str = str.replace(italicsPattern, '<span style="font-style:italic;">$1</span>$2');

The test data for the 3 choose 4 looks like:

1 _-*ISB*-_ 2 _-!ISU!-_ 3 _*-IBS-*_ 4 _*!IBU!*_
5 _!-IUS-!_ 6 _!*IUB*!_ 7 -_*SIB*_- 8 -_!SIU!_-
9 -*_SBI_*- 10 -*!SBU!*- 11 -!_SUI_!- 12 -!*SIB*!-
13 *_-BIS-_* 14 *_!BIU!_* 15 *-_BSI_-* 16 *-!BSU!-*
17 *!_BUI_!* 18 *!-BUS-!* 19 !_-UIS-_! 20 !_*UIB*_!
21 !-_USI_-! 22 !-*USB*-! 23 !*_UBI_*! 24 !*-UBS-*!

Can you even have a 4 level deep nested style span like any of the 24 permutations where all 4 modes are selected like:

    -!_*SUIB*_!-

Thanks I've been fighting this for about a week.

Bonus points for avoiding bad feedback from Mozilla for "Markup should not be passed to innerHTML dynamically." (I don't see how that might be possible when one is changing the formatting).

Thanks a million regex wizards! I am in your debt.

mwolfe.

Update

Using the same href detection as above and @talemyn help we are now at:

var boldPattern          = /(?!(?!.*?<a)[^<]*<\/a>)\*([^\s][^\*]*)\*/gi;
var italicsPattern       = /(?!(?!.*?<a)[^<]*<\/a>)_([^\s][^_]*)_/gi;
var strikethroughPattern = /(?!(?!.*?<a)[^<]*<\/a>)-([^\s][^-]*)-/gi;
var underlinePattern     = /(?!(?!.*?<a)[^<]*<\/a>)!([^\s][^!]*)!/gi;
str = str.replace(strikethroughPattern, '<s>$1</s>');
str = str.replace(italicsPattern, '<span style="font-style:italic;">$1</span>');
str = str.replace(boldPattern, '<strong>$1</strong>');
str = str.replace(underlinePattern, '<u>$1</u>');

Which seems to cover an extreme example:

    _wow *a real* !nice *person! on -stackoverflow* figured- it out_ cool beans.

I think one could use the style spans and do a regex lookback to determine the previous unclosed span, close it, open a new span with old format plus new attribute, close when supposed and open a new span to finish the formatting .. but that could get messy or impossible to do with regular expressions as @NovaDenizen points out.

Thank you for all your help. If there are any improvements please let me know. NB: I was unable to use and as the CSS on the site would not render it. Can that be overloaded? [This is for a firefox/greasemonkey/chrome plugin]

UPDATE (almost) FINAL

Using my 'broken' test phrase, as @MikeM correctly stated, as an example it would render correctly (minus the underline) in Google IM whether nested properly or not. So looking at the HTML output from the text in Google IM I noticed that it happily did not preformat the sting but simple did a substitute for as required.

So after looking at the site code which was using resetcss to remove I needed to insert the CSS formatting via javascript. Stackoverflow to the rescue. https://stackoverflow.com/questions/707565/how-do-you-add-css-with-javascript and https://stackoverflow.com/questions/20107/yui-reset-css-makes-strongemthis-not-work-em-strong

So my solution now looks like:

....
var css = document.createElement("style");
css.type = "text/css";
css.innerHTML = "strong, b, strong *, b * { font-weight: bold !important; } \
            em, i, em *, i * { font-style: italic !important; }";
document.body.appendChild(css);
 ....
var boldPattern          = /(?!(?!.*?<a)[^<]*<\/a>)\*([^\s][^\*]*)\*/gi;
var italicsPattern       = /(?!(?!.*?<a)[^<]*<\/a>)_([^\s][^_]*)_/gi;
var strikethroughPattern = /(?!(?!.*?<a)[^<]*<\/a>)-([^\s][^-]*)-/gi;
var underlinePattern     = /(?!(?!.*?<a)[^<]*<\/a>)!([^\s][^!]*)!/gi;
str = str.replace(strikethroughPattern, '<s>$1</s>');
str = str.replace(italicsPattern, '<i>$1</i>');
str = str.replace(boldPattern, '<b>$1</b>');
str = str.replace(underlinePattern, '<u>$1</u>');
.....

And tada it mostly works!

UPDATE FINAL SOLUTION After a last minute simplification on the anchor element check from @MikeM and combining the conditions from another stackoverflow post we have arrived at a complete working solution.

I also needed to add in a check for a one char style with closing symbol, since we were replacing trigger tokens side by side.

As @acheong87 reminded be careful with \w as it includes the _, so that was added to the wrapping conditionals for all but the strikethroughPattern.

var boldPattern          = /(?![^<]*<\/a>)(^|<.>|[\s\W_])\*(\S.*?\S)\*($|<\/.>|[\s\W_])/g;
var italicsPattern       = /(?![^<]*<\/a>)(^|<.>|[\s\W])_(\S.*?\S)_($|<\/.>|[\s\W])/g;
var strikethroughPattern = /(?![^<]*<\/a>)(^|<.>|[\s\W_])-(\S.*?\S)-($|<\/.>|[\s\W_])/gi;
var underlinePattern     = /(?![^<]*<\/a>)(^|<.>|[\s\W_])!(\S.*?\S)!($|<\/.>|[\s\W_])/gi;
str = str.replace(strikethroughPattern, '$1<s>$2</s>$3');
str = str.replace(italicsPattern, '$1<i>$2</i>$3');
str = str.replace(boldPattern, '$1<b>$2</b>$3');
str = str.replace(underlinePattern, '$1<u>$2</u>$3');

Thank you so much everyone (@MikeM, @talemyn, @acheong87, et al.)

mwolfe.

Community
  • 1
  • 1
Mike Wolfe
  • 314
  • 1
  • 10
  • [This famous answer](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) seems relevant. – NovaDenizen Mar 07 '13 at 18:45
  • 1
    Thanks @NovaDenizen, times like this you do need humor. I don't have a high ranking so unfortunately I can't vote up @talemyn's comment but they were MOST helpful to me. The nested span's just are not working, however by using `str = str.replace(strikethroughPattern, '$1');` `str = str.replace(italicsPattern, '$1');` `str = str.replace(boldPattern, '$1');` `str = str.replace(underlinePattern, '$1');` I was able to get it to work. Ugly though. – Mike Wolfe Mar 07 '13 at 19:35

3 Answers3

2

Try these:

var boldPattern          = /\*([^\s][^\*]*)\*/gi;
var italicsPattern       = /_([^\s][^_]*)_/gi;
var strikethroughPattern = /-([^\s][^-]*)-/gi;
var underlinePattern     = /!([^\s][^!]*)!/gi;

Though, in the replace, don't use the $2 as there is no second match in those regex patterns.

talemyn
  • 7,822
  • 4
  • 31
  • 52
  • 1
    I'll try that right now. My backup without looking for the href was: var boldPattern = /\*([^\s]+[\s\S]?[^\s]+)\*/gi; var italicsPattern = /_([^\s]+[\s\S]?[^\s]+)_/gi var strikethroughPattern = /-([^\s]+[\s\S]?[^\s]+)-/gi; var underlinePattern = /!([^\s]+[\s\S]?[^\s]+)!/gi; It also didn't work. – Mike Wolfe Mar 07 '13 at 18:29
  • Oh yeah . . . I missed the "href" part . . . that could potentially cause a problem, but start with those and get those patterns working first. Once you get that part working, you can build off of it to add functionality to catch the "href situation". Always start small and build from there. :) – talemyn Mar 07 '13 at 18:33
  • 1
    Wow - that is almost there (I did remove the $2s'). Using the following phrase didn't quite render well: \_wow \*a real\* !nice \*person! on -stackoverflow\* figured- it out\_ cool beans. It looked like _wow **a real** (underline)nice **person** on (strike)stackoverflow(endstrike) figured(endunderline) it out_ cool beans. – Mike Wolfe Mar 07 '13 at 18:39
  • 1
    Yeah, this would screw it up: `!nice *person! on -stackoverflow* figured-` because you would end up with invalid HTML (essentially, `nice person on stackoverflow figured`). Before you could address that, you'd have to figure out what behavior that you would want for a situation like that. Then you could update the patterns (or JS logic) accordingly. – talemyn Mar 07 '13 at 18:45
  • 1
    Oh boy. I verified that you code is correct using chrome inspector and the `

    ` contents, but like you said the `<\span>` are closing the nearest `` and not nesting. Am I trying to do something impossible? I have no idea where to begin with updating JS logic patterns. I also am forced to use style spans as the sites CSS is not enabling the older em/b/u etc.

    – Mike Wolfe Mar 07 '13 at 18:59
  • 1
    This does render correctly: `wow a real nice person on stackoverflow figured it out cool beans.` so I guess the next step would be to find some solution to convert old formatting to `span's` – Mike Wolfe Mar 07 '13 at 19:16
  • 2
    If that works, it's because your browser is being kind. :D That is totally invalid HTML. – talemyn Mar 07 '13 at 19:25
  • 1
    I wish I could rank up your post @talemyn because it was really helpful. I ended up using , , and ... crazy but it actually rendered properly in Chrome. I have no idea on how I could nest spans. The proper way would be to had some sort of lookback regex to find out if there is an open span, find out what `style="XYZ"` is set, close the `` and open a new `` and at the next instance of I would add append `` and close as before. That is beyond me but I suspect it is possible. – Mike Wolfe Mar 07 '13 at 19:41
1

The following shouldn't create incorrectly nested spans

var old;
var rx = /(?![^<]*(?:>|<\/a>))([!*_-])((?!\1)[^<>\s][^<>]*?)\1/g;

while ( old != str ) {
    old = str;
    str = str.replace( rx, function ( $0, $1, $2 ) {
        var style = $1 == '!' ? "text-decoration:underline"
                  : $1 == '*' ? "font-weight:bold"
                  : $1 == '_' ? "font-style:italic"
                              : "text-decoration:line-through";

        return  '<span style="' + style + ';">' + $2 + '</span>'
    } );
}

Because it replaces the outer delimiters first, there should never be any spans inserted inside delimiters.

Further explanation on request.

MikeM
  • 13,156
  • 2
  • 34
  • 47
  • 1
    That seems to have a problem with the inner bold on the bolded stackoverflow part. `_wow *a real* !nice *person! on -stackoverflow* figured- it out_ cool beans.` – Mike Wolfe Mar 07 '13 at 22:17
  • 1
    @MikeWolfe. The "stackoverflow" part isn't bolded as its opening `*` is a singleton within a `!...!` and therefore not considered to be a delimiter. If you allow that you will produce invalid html no matter what the method. – MikeM Mar 07 '13 at 22:39
  • 1
    @Mike Wolfe. If you wanted "person on stackoverlow" bolded, you would use `_wow *a real* !nice *person*! *on -stackoverflow*`. HTML elements must be properly nested! – MikeM Mar 07 '13 at 22:56
  • 1
    Thanks - you are correct @MikeM. If I properly nest the phrase then your code will render correctly. The sample string is actually an invalid starting point and should be modified to `_wow *a real* !nice *person*! *on -stackoverflow-* -figured- it out_ cool beans.` Trying the original phrase out in google IM renders as expected (minus underline). They are not using the style span, just the old fashion etc, which perhaps is easier. Too bad I can't overload the CSS styling to enable the traditional formatting. – Mike Wolfe Mar 07 '13 at 23:32
1

I recommend that you remove the inner negative look-aheads from your negative look-aheads:

/(?!(?!.*?<a)[^<]*<\/a>)_it_/.test( ' _it_ <a></a>' );         // true  (correct)
/(?!(?!.*?<a)[^<]*<\/a>)_it_/.test( '<a> _it_ </a>' );         // false (correct)
/(?!(?!.*?<a)[^<]*<\/a>)_it_/.test( '<a> _it_ </a> <a></a>' ); // true  (wrong)

/(?![^<]*<\/a>)_it_/.test( ' _it_ <a></a>' );                  // true  (correct)
/(?![^<]*<\/a>)_it_/.test( '<a> _it_ </a>' );                  // false (correct)
/(?![^<]*<\/a>)_it_/.test( '<a> _it_ </a> <a></a>' );          // false (correct)
MikeM
  • 13,156
  • 2
  • 34
  • 47
  • 1
    Thank you MikeM. I had been using your earlier solution of: `(?![^<]*(?:>|<\/a>))` as it was easier to read. Your new solution looks even better. Thank you a million. – Mike Wolfe Mar 08 '13 at 16:59