How would you get the URLs of all objects embedded into a webpage (or just the hostnames)? Which tag-attribute combination would you use? (or something else?)
For example, the Stackoverflow page starts like
<!DOCTYPE html>
<html>
<head>
<title>Stack Overflow</title>
<link rel="shortcut icon" href="//cdn.sstatic.net/stackoverflow/img/favicon.ico?v=4f32ecc8f43d">
<link rel="apple-touch-icon image_src" href="//cdn.sstatic.net/stackoverflow/img/apple-touch-icon.png?v=c78bd457575a">
[...]
<meta property="og:image" itemprop="image primaryImageOfPage" content="http://cdn.sstatic.net/stackoverflow/img/apple-touch-icon@2.png?v=73d79a89bded&a" />
Here, the URLs //cdn.sstatic.net/stackoverflow/img/favicon.ico?v=4f32ecc8f43d and //cdn.sstatic.net/stackoverflow/img/apple-touch-icon.png?v=c78bd457575a are in a href
attribute, while http://cdn.sstatic.net/stackoverflow/img/apple-touch-icon@2.png?v=73d79a89bded&a is in a content
attribute. Additionally, images and scripts have a src
attribute.
The images
HTMLCollection would be a starting point, but the DOM specification recommends
not to use this attribute to find the images in the document but
getElementsByTagName
Tags.attribute combinations to consider: a.href
, img.src
, link.href
, script.src
, and meta.content
. Which else?
Here is an approach given the right tag combinations. It is an example for the anchor tag:
var urls = [];
var allA = document.getElementsByTagName("A");
for ( var i = 0; i < allA.length; i++ ) {
if ( typeof allA[i].href === "string" && allA[i].href !== "" ) {
urls.push(allA[i].href);
}
}
This could be repeated for all tag-attribute combinations.
Which tags with which attributes did I miss?