11

I have a javascript variable containing the HTML source code of a page (not the source of the current page), I need to extract all links from this variable. Any clues as to what's the best way of doing this?

Is it possible to create a DOM for the HTML in the variable and then walk that?

Gavin Miller
  • 43,168
  • 21
  • 122
  • 188
Hinchy
  • 683
  • 2
  • 7
  • 19
  • I think your best approach will be using some kind of JS HTML document parser. Alternatively you could use regular expressions, but I don't think that's the best way of doing this. – Waleed Amjad Sep 28 '09 at 15:11

5 Answers5

12

I don't know if this is the recommended way, but it works: (JavaScript only)

var rawHTML = '<html><body><a href="foo">bar</a><a href="narf">zort</a></body></html>';

var doc = document.createElement("html");
doc.innerHTML = rawHTML;
var links = doc.getElementsByTagName("a")
var urls = [];

for (var i=0; i<links.length; i++) {
    urls.push(links[i].getAttribute("href"));
}
alert(urls)
andre-r
  • 2,685
  • 19
  • 23
  • If in node, this answer can be extended with JSDOM: `const { JSDOM } = jsdom;` `const dom = new JSDOM(rawHTML);` `const { document: doc } = dom.window;` then skip to `var links = doc.getElementsByTagName("a")` – Nth.gol Oct 17 '18 at 17:22
7

If you're using jQuery, you can really easily I believe:

var doc = $(rawHTML);
var links = $('a', doc);

http://docs.jquery.com/Core/jQuery#htmlownerDocument

brianreavis
  • 11,562
  • 3
  • 43
  • 50
2

This is useful esepcially if you need to replace links...

var linkReg = /(<[Aa]\s(.*)<\/[Aa]>)/g;

var linksInText = text.match(linkReg);
Wosis
  • 29
  • 1
  • When I tested this with this HTML string `var text = 'bar
    dfgdfg
    ghghhkhkhzort';` It returned the whole HTML text string and not just matches for A links
    – JasonDavis May 07 '16 at 01:50
1

If you're running Firefox YES YOU CAN ! It's called DOMParser , check it out:

DOMParser is mainly useful for applications and extensions based on Mozilla platform. While it's available to web pages, it's not part of any standard and level of support in other browsers is unknown.
xxxxxxx
  • 5,037
  • 6
  • 28
  • 26
1

If you are running outside a browser context and don't want to pull a HTML parser dependency, here's a naive approach:

var html = `
<html><body>
  <a href="https://example.com">Example</a>
  <p>text</p>
  <a download href='./doc.pdf'>Download</a>
</body></html>`

var anchors = /<a\s[^>]*?href=(["']?)([^\s]+?)\1[^>]*?>/ig;
var links = [];
html.replace(anchors, function (_anchor, _quote, url) {
  links.push(url);
});

console.log(links);
Alex Gyoshev
  • 11,929
  • 4
  • 44
  • 74