Confused with Regex JS pattern

Question

ok i do have this following data in my div

<div id="mydiv">
<!--
 what is your present
 <code>alert("this is my present");</code>
 where?
 <code>alert("here at my left hand");</code>
 oh thank you! i love you!! hehe
  <code>alert("welcome my honey ^^");</code>
-->
</div>

well what i need to do there is to get the all the scripts inside the <code> blocks and the html codes text nodes without removing the html comments inside. well its a homework given by my professor and i can't modify that div block..

I need to use regular expressions for this and this is what i did

var block = $.trim($("div#mydiv").html()).replace("<!--","").replace("-->","");
var htmlRegex = new RegExp(""); //I don't know what to do here
var codeRegex = new RegExp("^<code(*n)</code>$","igm");

var code = codeRegex.exec(block);
var html = "";

it really doesn't work... please don't give the exact answer.. please teach me.. thank you

I need to have the following blocks for the variable code

alert("this is my present");
alert("here at my left hand");
alert("welcome my honey ^^");

and this is the blocks i need for variable html

 what is your present
     where?
     oh thank you! i love you!! hehe

my question is what is the regex pattern to get the results above?

Looks like you have an extra parenthesis in your code regex. `...$)"...` — sachleen, Jul 07 '12 at 15:59
Tell your teacher we told you to please refrain from parsing HTML with RegEx as it will [drive you į̷̷͚̤̤̖̱̦͍͗̒̈̅̄̎n̨͖͓̹͍͎͔͈̝̲͐ͪ͛̃̄͛ṣ̷̵̞̦ͤ̅̉̋ͪ͑͛ͥ͜a̷̘͖̮͔͎͛̇̏̒͆̆͘n͇͔̤̼͙̩͖̭ͤ͋̉͌͟eͥ͒͆ͧͨ̽͞҉̹͍̳̻͢](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454). Use an [HTML parser](http://stackoverflow.com/questions/292926/robust-mature-html-parser-for-php) instead. — Madara's Ghost, Jul 07 '12 at 16:34
I still don't understand what you're asking. You say don't give the exact answer, but then ask for a regex pattern to solve your problem. What is the thing that confuses you? — Brad Koch, Jul 07 '12 at 16:43
Wait, what is *n? That doesn't look right to me. @Truth Parsing HTML with regEx is not at all an uncommon problem in JS. — Erik Reppen, Jul 07 '12 at 17:12
I'm glad you got your answer @Mahan! Be sure to check out the [homework question guidelines](http://meta.stackexchange.com/a/10812/180500) for next time. FWIW, the assignment really isn't a good application of regex in the first place. — Brad Koch, Jul 07 '12 at 17:37
well my professor loves me to give this kind of headaches haha thank you ^^ — Netorica, Jul 07 '12 at 17:49

score 5 · Answer 1 · answered Jul 07 '12 at 17:35

5

Parsing HTML with a regular expression is not something you should do.

I'm sure your professor thinks he/she was really clever and that there's no way to access the DOM API and can wave a banner around and justify some minor corner-case for using regex to parse the DOM and that sometimes it's okay.

Well, no, it isn't. If you have complex code in there, what happens? Your regex breaks, and perhaps becomes a security exploit if this is ever in production.

So, here:

http://jsfiddle.net/zfp6D/

Walk the dom, get the nodeType 8 (comment) text value out of the node.
Invoke the HTML parser (that thing that browsers use to parse HTML, rather than regex, why you wouldn't use the HTML parser to parse HTML is totally beyond me, it's like saying "Yeah, I could nail in this nail with a hammer, but I think I'm going to just stomp on the nail with my foot until it goes in").
Find all the CODE elements in the newly parsed HTML.
Log them to console, or whatever you want to do with them.

answered Jul 07 '12 at 17:35

Incognito

20,537
15
80
120

That's all well and good, but when set homework on regular expressions the point is usually so that you can learn how to use regular expressions, rather than when to. – OrangeDog Jul 07 '12 at 18:04
@OrangeDog Sounds like a lousy way to learn, to be blunt about the whole thing. I'd demand better real-world examples that will actually come in useful, especially if I were paying. It makes me sad to see homework like this :( – Incognito Jul 07 '12 at 18:11
I'm sorry. What's wrong with using regEx again? Also, this is not a DOM problem. It's a comment in the DOM containing XML-formatted data. Why do people freak about regEx? If you know what you're doing, it's both efficient and reliable. Also it's much less of a PITA to just use innerHTML. It's de facto spec and not going anywhere any time soon. – Erik Reppen Jul 07 '12 at 18:48
1

,@ErikReppen I suggest you read the most upvoted answer on SO. The one about parsing HTML with regex. – Florian Margaine Jul 07 '12 at 21:01
@FlorianMargaine it's only so high because it's also funny and only applies in the general HTML-parsing case. When you know you're dealing with a small subset that does form a regular language, there's noting wrong with using regular expressions. – OrangeDog Jul 08 '12 at 17:21
1

Why are you re-inventing a parser when one exists and is not prone to errors? – Incognito Jul 08 '12 at 18:35
1

**+1** for stomping on the nail. :-) – ghoti Aug 22 '12 at 00:42

score 1 · Accepted Answer · answered Jul 07 '12 at 16:45

First of all, you should be aware that because HTML is not a regular language, you cannot do generic parsing using regular expressions that will work for all valid inputs (generic nesting in particular cannot be expressed with regular expressions). Many parsers do use regular expressions to match individual tokens, but other algorithms need to be built around them

However, for a fixed input such as this, it's just a case of working through the structure you have (though it's still often easier to use different parsing methods than just regular expressions).

First lets get all the code:

var code = '', match = [];
var regex = new RegExp("<code>(.*?)</code>", "g");
while (match = regex.exec(content)) {
    code += match[1] + "\n";
}

I assume content contains the content of the div that you've already extracted. Here the "g" flag says this is for "global" matching, so we can reuse the regex to find every match. The brackets indicate a capturing group, . means any character, * means repeated 0 or more times, and ? means "non-greedy" (see what happens without it to see what it does).

Now we can do a similar thing to get all the other bits, but this time the regex is slightly more complicated:

new RegExp("(<!--|</code>)(.*?)(-->|<code>)", "g")

Here | means "or". So this matches all the bits that start with either "start comment" or "end code" and end with "end comment" or "start code". Note also that we now have 3 sets of brackets, so the part we want to extract is match[2] (the second set).

An alternative using `indexOf` and `slice` would probably be easier to read and understand, but presumably that's not the point of the homework. And for the general case you'd need a proper HTML parser. — OrangeDog, Jul 07 '12 at 16:48
And here's a handy reference for regular expressions in JavaScript: https://developer.mozilla.org/en/JavaScript/Reference/Global_Objects/RegExp — OrangeDog, Jul 07 '12 at 16:51
Ick. Don't use '.'. Highly inefficient. It checks for every character regEx is capable of matching until you get a match. Use negative character classes. [^<]* (if it's not a '<' you get a match - much, much faster). RegEx is perfectly fast when optimized. — Erik Reppen, Jul 07 '12 at 18:40
Using `.` a) gives a simpler to understand expression where speed optimisations are not important (tiny imput, few calls) and b) won't break if there are occurrences of `<` within the code (it's a common mathematical operator so not unexpected). — OrangeDog, Jul 08 '12 at 17:18
Not it won't, it's in a comment. That's how the doesn't break it either. — OrangeDog, Jul 08 '12 at 17:52

score 1 · Answer 3 · answered Jul 07 '12 at 18:08

You're doing a lot of unnecessary stuff. .html() gives you the inner contents as a string. You should be able to use regEx to grab exactly what you need from there. Also, try to stick with regEx literals (e.g. /^regexstring$/). You have to escape escape characters using new RegExp which gets really messy. You generally only want to use new RegExp when you need to put a string var into a regEx.

The match function of strings accepts regEx and returns a collection of every match when you add the global flag (e.g. /^regexstring$/g <-- note the 'g'). I would do something like this:

var block = $('#mydiv').html(), //you can set multiple vars in one statement w/commas
matches = block.match(/<code>[^<]*<\/code>/g);

//[^<]* <-- 0 or more characters that aren't '<' - google 'negative character class'

matches.join('_') //lazy way of avoiding a loop - join into a string with a safe character
.replace(/<\/*code>/g,'') //\/* 0 or more forward slashes
.split('_');//return the matches string back to array

//Now do what you want with matches. Eval (ew) or append in a script tag (ew).
//You have no control over the 'ew'. I just prefer data to scripts in strings

Confused with Regex JS pattern

3 Answers3