0

Fighting with regex....

I'm using this to find pieces of HTML-string between certain elements:

 for (i = 0; i < 2; i += 1) {
   target = block[i];   // like BODY or HEAD
   regex = RegExp('<' + target + '>(.)+</' + target + '>');
   // in case string passed includes breaks/spaces
   data = data.replace(/(\r\n|\n|\r)/gm,"").replace(/\s+/g," ")
             .match(regex);
   entry = data[0].replace(/<!-- [\s\S]*? -->/g, '');
   console.log(entry);
 }

While this works fine, it returns something like this:

<head>....everthing I want ....</head>

Question:
How do I need to modifiy the regex, so that I can still specifiy the element whose content I need, but which returns only the content and not content & tokens (like <head></head>).

Thanks!

frequent
  • 27,643
  • 59
  • 181
  • 333
  • 1
    Use Ambers solution and also move the parens to include the `+` like this `'<' + target + '>(.+)' + target + '>'` – Dehalion Feb 16 '13 at 22:49
  • Is there anything wrong with `$(target).each(function(){ console.log($(this).html()); })` apart from the comment nodes? – Fabrício Matté Feb 16 '13 at 22:51
  • @FabrícioMatté: actually no. I had some templates, where comments , but this one does not, so also trying this. – frequent Feb 16 '13 at 22:53
  • Of course, spaces still have to be collapsed with regex, and comment nodes can be removed with either `contents().filter()` or regex but yes, I'm still unsure of what you're trying to achieve. – Fabrício Matté Feb 16 '13 at 22:55
  • @Fabricio: I'm working on a plugin that pulls in snippets of code, which I prefer to be snippets, but which come as (uncompressed) HTML pages (think of a page with a button). I'm having to extract the bits and pieces of the snippet page to use, because I cannot append the full snippet as is. So I created the regex to filter for script/css, which I'm appending to page head, plus whats in the body (e.g. the solo button), which goes into the page. I solved it with Ambers answer, so I'm a happy camper. Thanks! – frequent Feb 16 '13 at 22:59
  • No problem. `=]` Though you know, regex is not really [suitable for parsing HTML](http://stackoverflow.com/a/1732454/1331430). That means, your regex will fail to match if the tag has any html attribute e.g. ``, or if `` is inside a comment node and so forth. Hopefully you aren't using thar regex on the wild. `:P` – Fabrício Matté Feb 16 '13 at 23:01
  • @FabrícioMatté: well... technically it's being loaded as a string (requirejs text plugin). So not really sure, but this is a temp patch anyway until I find a better solution. – frequent Feb 16 '13 at 23:10

1 Answers1

1

Use the first matching group instead of the whole match.

regex = RegExp('<' + target + '>(.+)</' + target + '>');

and then...

entry = data[1].replace(/<!-- [\s\S]*? -->/g, '');
Amber
  • 507,862
  • 82
  • 626
  • 550
  • Note the slight edit - you need `(.+)` (one matching group of repeated characters) rather than `(.)+` (repeated matching groups of one character each). – Amber Feb 16 '13 at 22:50
  • nice! I was looking at my `[1]` returning `>` wondering what to make of it :-) Thanks a lot! – frequent Feb 16 '13 at 22:55