5

I'm trying to get some page details (page title, images on the page, etc.) of an arbitrarily entered URL/page. I have a back-end proxy script that I use via an ajax GET in order to return the full HTML of the remote page. Once I get the ajax response back, I'm trying to run several jQuery selectors on it to extract the page details. Here's the general idea:

$.ajax({
        type: "GET",
        url: base_url + "/Services/Proxy.aspx?url=" + url,
        success: function (data) {
            //data is now the full html string contained at the url

            //generally works for images
            var potential_images = $("img", data); 

            //doesn't seem to work even if there is a title in the HTML string
            var name = $(data).filter("title").first().text();

            var description = $(data).filter("meta[name='description']").attr("content"); 

        }
    });

Sometimes using $("selector", data) seems to work while other times $(data).filter("selector") seems to work. Sometimes, neither works. When I just inspect the contents of $(data), it seems that some nodes make it through, but some just disappear. Does anyone know a consistent way to run selectors on a full HTML string?

Ender
  • 14,995
  • 8
  • 36
  • 51
Ben
  • 63
  • 2
  • 4

1 Answers1

2

Your question is kind of vague, especially w/r/t what input causes what code to fail, and how. It could be malformed HTML that's mucking things up - but I can only guess.

That said, your best bet is to work with $(data) rather than data:

$.ajax({
    type: "GET",
    url: base_url + "/Services/Proxy.aspx?url=" + url,
    success: function(data) {
        var $data = $(data);

        //data is now the full html string contained at the url
        //generally works for images
        var potential_images = $("img", $data);

        //doesn't seem to work even if there is a title in the HTML string
        var name = $data.filter("title").first().text();

        var description = $data.filter("meta[name='description']").attr("content");
    }
});
Matt Ball
  • 354,903
  • 100
  • 647
  • 710
  • Unfortunately the input could potentially be the HTML of any arbitrary page. I've tried with numerous popular websites including cnn.com, twitter.com, and espn.go.com -- all of which seem to have the same problems, especially with extracting the title. – Ben Dec 15 '10 at 23:27