Get all URLs from an external URL

Question

I'm trying to get all URLs from a page using jQuery to call them later on using $.get(). If they were on the same page as the script is included in, it would be no problem calling something like

var links = document.getElementsByTagName("a");
for(var i=0; i<links.length; i++) {
    alert(links[i].href);
}

In this case I'd just use alert to check that the links were actually parsed. But how can I do the same thing with an URL that is not the current page? Any help would be appreciated. Maybe I'm missing something ridiculously simple but I am really stumped when it comes to anything JavaScript/JQuery related.

To access the content of a page on a different domain that page must be written to allow you to do so, its not possible (in the client) by default (Same Origin Policy) — Alex K., May 22 '17 at 15:10
You would have to 1. `$.get()` the other page 2. use an HTML parser to parse the HTML source into a DOM-object 3. search that for links — , May 22 '17 at 15:11
Given that an arbitrary URL won't allow you to, see [jquery .load() page then parse html](https://stackoverflow.com/questions/3856590/jquery-load-page-then-parse-html) — Alex K., May 22 '17 at 15:13

score 2 · Answer 1 · answered May 22 '17 at 15:17

Blatantly copying this answer by Nick Craver (go upvote it), but modifying it for your use case:

$.get("page.html", function(data) {
  var data = $(data);
  var links = data.find('a');
  //do stuff with links
});

Note that this will only work if the page you're hitting is set up for cross-origin request. If it isn't, you'll need to do the same with a Dom-parser from a backend server. Nodejs has some great options there, including jsDom.

varbrad · Answer 2 · 2017-05-22T15:30:14.587

1

You will have to get the other page via an HTTP request ($.get in JQuery achieves this), and then either go about converting that HTML into a DOM that JQuery can then traverse and find the <a> tags for you, ~~or use another method such as a regular expression to find all the links within the returned markup.~~

edit: Probably don't actually use a regex unless you have a guaranteed HTML format and can guarantee the format of all <a> tags on the page. By this point, it's probably just easier to parse the HTML for real.

edited May 22 '17 at 15:30

answered May 22 '17 at 15:15

varbrad

474
3
11

Please don not parse HTML with a regex! https://stackoverflow.com/questions/590747/using-regular-expressions-to-parse-html-why-not – Pevara May 22 '17 at 15:19
It can be done if you can guarantee the markup format, but you are right probably just easier to parse the HTML for real and go from there. – varbrad May 22 '17 at 15:31

score 0 · Answer 3 · answered May 22 '17 at 15:20

Collect the current page URL using window.location.href and then match the same with the href of other "a" tags in the loop

var links = document.getElementsByTagName("a");
var thisHref = window.location.href;
for(var i=0; i<links.length; i++) {
    templink = links[i].href;
    if (templink != thisHref){// if the link is not same with current page URL
        alert(links[i].href);
    }
}

Get all URLs from an external URL

3 Answers3