What regular expression would match this data?

Question

I have the following within an XHTML document:

<script type="text/javascript" id="JSBALLOONS">
    function() {
        this.init = function() {
            this.wAPI = new widgetAPI('__BALLOONS__');
            this.getRssFeed();
        };
    }
</script>

I'm trying to select everything in between the two script tags. The id will always be JSBALLOONS if that helps. I know how to select that including the script tags, but I don't know how to select the contents excluding the script tags. The result of the regular expression should be:

    function() {
        this.init = function() {
            this.wAPI = new widgetAPI('__BALLOONS__');
            this.getRssFeed();
        };
    }

Hello, my thanks was removed by a moderator!?! FYI, the end of this post used to include: Thanks, Pete. I dislike moderators nitpicking my posts especially removing my courtesy. — slypete, Jun 23 '09 at 18:27

molf · Accepted Answer · 2009-06-23T18:47:48.553

8

(Updated post specifically for a Javascript solution.)

In Javascript, your code might look like this:

if (data.match(/<script[^>]+id="JSBALLOONS">([\S\s]*?)<\/script>/)) {
    inner_script = RegExp.$1;
}

That part between parentheses ([\S\s]*?) is saved by the regex engine and is accessible to you after a match is found. In Javascript, you can use RegExp.$1 to reference to the matched part inside the script tags. If you have more than one of such a group, surrounded by (), you can refer to them with RegExp.$2, and so on, up to RegExp.$9.

Javascript will not match newline characters by default, so that is why we have to use ([\S\s]*?) rather than (.*?), which may make more sense. Just to be complete, in other languages this is not necessary if you use the s modifier (/.../s).

(I have to add that regexes are typically very fragile when scraping content from HTML pages like this. You may be better off using the jQuery framework to extract the contents.)

edited Jun 23 '09 at 18:47

answered Jun 23 '09 at 18:08

molf

73,644
13
135
118

Hi, thanks. This is exactly what I have, but it includes the script tags. Can you explain what you mean by $1? I'm unfamiliar. Thanks! – slypete Jun 23 '09 at 18:22
@slypete, which language or tool are you using to execute the regex? – molf Jun 23 '09 at 18:24
@molf, I'm using javascript and jQuery. var javascript = this.data.match(/ – slypete Jun 23 '09 at 18:28
@slypete, updated with an example in Javascript. In Javascript, groups are saved in RegExp.$1, RegExp.$2, etc, up to RegExp.$9. – molf Jun 23 '09 at 18:40

score 2 · Answer 2 · answered Jun 23 '09 at 18:37

What the gentleman means by $1 is "the value of the first capture group". When you enclose part of your regular expression in parentheses, it defines capture groups. You count them from the left to the right. Each opening parenthesis starts a new capture group. They can be nested.

(There are ways to define sub expressions without defining capture groups - I forget the syntax.)

In Perl, $1 is the magic variable holding the string matched by the first capture group, $2 is the string matched by the second, etc. Other languages may require you to call a method on the returned match object to get the Nth capture group.

But back to molf's solution. Suppose he said to use this pattern instead:

/<script[^>]+id="JSBALLOONS">(.*)<\/script>/

In this case, if you have more than one script element, this incorrect pattern will gobble them all up because it is greedy, a point worth explaining. This pattern will start with the first opening tag, match to its closing tag, keep going, and finally match the last . The magic in molf's solution is the question mark in (.*?) which makes it non-greedy. It will return the shortest string that matches the pattern, hence not gobble up extra script elements.

Svante · Answer 3 · 2009-06-23T22:22:09.870

2

Don't try to use regular expressions for non-regular languages. The right way is to use an XML parser, resp. the DOM:

document.getElementById("JSBALLOONS")

edit: Regarding your comment, I have no experience with JavaScript or jQuery, but after some searching, I think that something along these lines should work:

$.ajax({
  type: "GET",
  url: "test.xml",
  dataType: "xml",
  success: function(xml) {
    return $(xml).find("#JSBALLOONS").text();
  }
});

Can someone more qualified correct this?

edited Jun 23 '09 at 22:22

answered Jun 23 '09 at 18:44

Svante

50,694
11
78
122

The document is remotely loaded into a string that I need to extract select things from. I'm aware regex is not the best solution. Please do let me know if you know of other working solutions. Thanks! – slypete Jun 23 '09 at 19:01
Again, it will not work. I've tried this. Please see my other more general question for the reason: http://stackoverflow.com/questions/1034881/what-is-the-best-practice-for-parsing-remote-content-with-jquery Hopefully someone will be able to come up with an answer for this question. – slypete Jun 24 '09 at 00:30
I always like to post a link to this when anyone mentions regexing a tagged language http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Joel Berger Jan 09 '11 at 23:08

Christoph · Answer 4 · 2009-06-23T19:28:03.980

0

Let foo be the string containing the code. Then, you can strip the enclosing tags via

foo = foo.substring(foo.indexOf('>') + 1, foo.lastIndexOf('<'))

edited Jun 23 '09 at 19:28

answered Jun 23 '09 at 19:00

Christoph

164,997
36
182
240

What regular expression would match this data?

4 Answers4