URLs of all embedded objects in a HTML page

Question

How would you get the URLs of all objects embedded into a webpage (or just the hostnames)? Which tag-attribute combination would you use? (or something else?)

For example, the Stackoverflow page starts like

<!DOCTYPE html>
<html>
<head>

<title>Stack Overflow</title>
    <link rel="shortcut icon" href="//cdn.sstatic.net/stackoverflow/img/favicon.ico?v=4f32ecc8f43d">
    <link rel="apple-touch-icon image_src" href="//cdn.sstatic.net/stackoverflow/img/apple-touch-icon.png?v=c78bd457575a">
[...]   
    <meta property="og:image" itemprop="image primaryImageOfPage" content="http://cdn.sstatic.net/stackoverflow/img/apple-touch-icon@2.png?v=73d79a89bded&a" />

Here, the URLs //cdn.sstatic.net/stackoverflow/img/favicon.ico?v=4f32ecc8f43d and //cdn.sstatic.net/stackoverflow/img/apple-touch-icon.png?v=c78bd457575a are in a href attribute, while http://cdn.sstatic.net/stackoverflow/img/apple-touch-icon@2.png?v=73d79a89bded&a is in a content attribute. Additionally, images and scripts have a src attribute.

The images HTMLCollection would be a starting point, but the DOM specification recommends

not to use this attribute to find the images in the document but getElementsByTagName

Tags.attribute combinations to consider: a.href, img.src, link.href, script.src, and meta.content. Which else?

Here is an approach given the right tag combinations. It is an example for the anchor tag:

var urls = [];
var allA = document.getElementsByTagName("A");
for ( var i = 0; i < allA.length; i++ ) {
    if ( typeof allA[i].href === "string" && allA[i].href !== "" ) {
      urls.push(allA[i].href);
    }
}

This could be repeated for all tag-attribute combinations.

Which tags with which attributes did I miss?

Stackoverflow is not a free coding service. What have you tried? What was your effort? Where did you fail or encounter problems? — RononDex, Feb 12 '16 at 11:47
@RononDex: i was mostly unsure whether the above tags would suffice. Seems like you would like some code with `getElementsByTagName`, right? Give me a bit of time. — serv-inc, Feb 12 '16 at 11:49
No I am asking what have you tried by your own to solve this? Right now you are just asking us to do your job for free. — RononDex, Feb 12 '16 at 11:50
@RononDex: Just to make sure that people do not think as you did, there is some code now. Repeat the loop for each tag combination to get the URLs. Just to make sure that **this is not about code, but about the approach === the tags** (or any other approach that is more fitting). — serv-inc, Feb 12 '16 at 12:04
One suggestion from my side would be to use a regex to extract urls. — RononDex, Feb 12 '16 at 12:05
However your question still does not conform to the Stackoverflow requirements for a valid question: You are asking us "what ways are there", which has not 1 clear correct answer but instead may have 1000's of valid answers. Questions need to fullfill the criteria that they can be answered with 1 clear correct answer. — RononDex, Feb 12 '16 at 12:07
@RononDex: How would check that they do not appear in text? Re2: I could not find the "can be answered with 1 clear correct answer" in http://stackoverflow.com/help/on-topic and http://stackoverflow.com/help/dont-ask. Do you think this is a question that has many opinionated answers? Granted, it is a implementation problem with different approaches, but those are allowed, right? — serv-inc, Feb 12 '16 at 12:11
Possible duplicate of [Detect URLs in text with JavaScript](http://stackoverflow.com/questions/1500260/detect-urls-in-text-with-javascript) — Asons, Feb 12 '16 at 12:38
@LGSon: thank you for the link. If nothing better shows up, that is a possible solution. Yet DOM already differentiates, so the question remains: *"How would check that they do not appear in text?"* — serv-inc, Feb 12 '16 at 12:43

score 0 · Accepted Answer · edited May 23 '17 at 11:46

The tags <a> and <meta> were too much: <a> elements are not embedded and <meta> turned up some URLs, but not embedded, either. Thus, an attempt would look like

function getAttributeFromTags(tag, attribute) {
  var out = [];
  var allA = document.getElementsByTagName(tag);
  for (var i = 0; i < allA.length; i++) {
    if (typeof allA[i][attribute] === 'string' && allA[i][attribute] !== '') {
      out.push(allA[i][attribute]);
    }
  }
  return out;
}
var urls = [];
Array.prototype.push.apply(urls, getAttributeFromTags('AUDIO', 'src'));
Array.prototype.push.apply(urls, getAttributeFromTags('EMBED', 'src'));
Array.prototype.push.apply(urls, getAttributeFromTags('IMG', 'src'));
Array.prototype.push.apply(urls, getAttributeFromTags('LINK', 'href'));
Array.prototype.push.apply(urls, getAttributeFromTags('OBJECT', 'data'));
Array.prototype.push.apply(urls, getAttributeFromTags('SCRIPT', 'src'));
Array.prototype.push.apply(urls, getAttributeFromTags('SOURCE', 'src'));
Array.prototype.push.apply(urls, getAttributeFromTags('VIDEO', 'src'));

Caveat

using link.href includes too many URLs (f.ex. have a look at view-source:https://www.youtube.com/watch?v=kPUglMKGXRM (SO does not allow view-source links...)).

Implementation

HTMLCollection does not offer forEach, (except with a weird syntax), and the workarounds are not widely supported.

URLs of all embedded objects in a HTML page

1 Answers1

Caveat

Implementation