BBCode regular expression parsing problem

Question

So I have some Javascript that converts BBCode to HTML, which seems to work well, but I have a problem.

Here is one of the expressions that works which I use to convert the BB tags [b] and [/b] to <b> and </b>.

str = str.replace(/\[b\]((\s|\S)*?)\[\/b\]/ig, '<b>$1</b>');

This also converts consecutive tags. For example

[b]str1[/b] [b]str2[/b]

becomes

str1 str2

Which is good; that's what I want it to do. However, when I try to match quote tags like so

str = str.replace(/\[quote\]((\s|\S)*?)\[\/quote\]/ig, '<span class="quotebox">$1</span>');

where str is

[quote]Nest level 1[quote]Nest level 2[/quote][/quote]

only the first tag is matched and converted, so I'll end up getting output looking like

Nest level 1 [quote]Nest level 2

[/quote]

With the last quote tag outside of the quote box - it should be nested within the other one. Help?

Also, if it's relevant, the quotebox class is as follows

.quotebox{
border:1px inset black;
display:block;
margin-bottom:5px;
margin-top:5px;
padding:2px 2px 2px 4px;
}

Odd question; why are you doing the conversion with Javascript (client-side), and when are you doing that conversion? — Andrew Barber, Dec 19 '10 at 02:01
This is why regular expressions *cannot* be used to parse irregular languages. — Ignacio Vazquez-Abrams, Dec 19 '10 at 02:05
I'm using it as a kind of "post previewer" for another site through a Greasemonkey script. It works similarly to how stackoverflow functions when you ask a question. — sam, Dec 19 '10 at 02:06
@Ignacio It's not all bad. The way I've got it to work so far is just to test whether the match exists then do the replacement. I'm just wondering if there's a way to do it with a single expression and the global modifier (which is how I thought it worked). — sam, Dec 19 '10 at 02:09

score 1 · Answer 1 · edited May 23 '17 at 12:26

You've just been bitten by the fact that (real) regular expressions can only describe regular languages. The salient feature regular expressions cannot describe is recursion. The canonical example of this is the Dyck language, the language which consists of all strings of balanced parentheses, such as (), (())()((())), ((((())))), etc. This is non-regular, and is essentially the problem you're trying to solve: matching appropriately-nested [b][/b]s, [quote][/quote]s, and the like. In other words, it's literally impossible to do what you want with a regular expression. However, you may have noticed that I said "real". The regexes provided in languages like JavaScript aren't true regular expressions; they have extra power, mostly (entirely?) stemming from backreferences. The regex (.*)\1, for instance, describes a non-regular language. Even given this, though, I don't think you can match the Dyck language.¹

So, then, what's the solution? Find a pre-existing BBCode to HTML converter written in JavaScript! This is definitely going to make your life the simplest. I don't know of one off the top of my head, unfortunately, since I don't do much JavaScript programming. This StackOverflow question indicates that such a thing might not exist, in which case your only option is to roll your own parser. More complicated, of course, but certainly doable. Off the top of my head (I am not an expert), you'd probably want to scan through the string until you find a tag. (Recognizing a tag may well be a good task for a regular expression.) If it's an opening tag, push that on a stack. If it's a closing tag, pop the stack, make sure that the closing tag matches the opening tag, and wrap the string you've seen so far in the appropriate HTML. This might not work, or it might be too complicated—it's just my 2¢ after thinking about the problem quickly.

1: I'm not 100% sure, but the only example of a regex matching balanced parentheses I've ever seen was in Perl, and it embedded Perl code, which JavaScript can't do. Either way, it's inadvisable—you're trying to use a tool which will make your task much more complicated.)

BBCode regular expression parsing problem

1 Answers1