0

In a variable I'm holding HTML source code, which I obtained from DB. I'd like to search this content through for all the "a href" attributes and list them in a table.

Now I've found here how to search it in a DOM (like below), but how to use it to search within a variable?

var links = document.getElementsByTagName("a").getElementsByAttribute("href");

Got this currently, which is searching by RegEx, but it doesn't work very well:

matches_temp = result_content.match(/\b((?:[a-z][\w-]+:(?:\/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’&quote]))/ig);

In result_content I'm holding that HTML Source.

ZeRoberto
  • 5
  • 2
  • Not all A elements have a href attribute. Have you considered using a [*DOMparser*](https://developer.mozilla.org/en-US/docs/Web/API/DOMParser) and using the [*links*](https://developer.mozilla.org/en-US/docs/Web/API/Document/links) property? – RobG Feb 25 '19 at 07:29

2 Answers2

0

getElementsByTagName returns a nodelist that does not have a method called getElementsByAttribute but ONLY if you have DOM access

Without DOM (for example node.js)

const hrefRe = /href="(.*?)"/g;
const urlRe = /\b((?:[a-z][\w-]+:(?:\/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’&quote]))/ig;

 
const stringFromDB = `<a href="http://000">000</a>
Something something <a href="http://001">001</a> something`

stringFromDB.match(hrefRe).forEach(
 (href) => console.log(href.match(urlRe)[0] ) 
);

// oldschool: 
// stringFromDB.match(hrefRe).forEach(function(href) {  console.log(href.match(urlRe)[0] )      });

In this code I create a DOM snippet first Also I ONLY get anchors that have an href to begin with

NOTE the getAttribute so the browser does not try to interpret the URL

With the regex if you wanted to only match SPECIFIC types of href:

const re = /\b((?:[a-z][\w-]+:(?:\/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’&quote]))/ig;

const stringFromDB = `<a href="http://000">000</a>
<a href="http://001">001</a>`

let doc = document.createElement("div");
doc.innerHTML = stringFromDB

doc.querySelectorAll("a[href]").forEach(
  (x) => console.log(x.getAttribute("href").match(re)[0])
);

Without the regex

const stringFromDB = `<a href="http://000">000</a>
<a href="http://001">001</a>`

let doc = document.createElement("div");
doc.innerHTML = stringFromDB

doc.querySelectorAll("a[href]").forEach(
 (x) => console.log(x.getAttribute("href")) 
);
mplungjan
  • 169,008
  • 28
  • 173
  • 236
  • @RobG OP is not parsing HTML, he is matching HTML in an HREF! I am using the DOM to get the href – mplungjan Feb 25 '19 at 07:31
  • Ah.. I of course assume he creates a DOM snippet first – mplungjan Feb 25 '19 at 07:33
  • Sorry, but I'm struggling to understand how did you connect that RegEx I wanted to skip entirely into that matching? – ZeRoberto Feb 25 '19 at 08:11
  • I thought you wanted to only list the hrefs that matched your very long RegEx. If you do not need it, don't use it, just use `x.getAttribute("href")` – mplungjan Feb 25 '19 at 08:15
  • Thanks @mplungjan , but now I'm getting "document.createElement is not a function." error on compilation – ZeRoberto Feb 25 '19 at 08:25
  • "Compilation"? Are you running this in a browser or on node or something? – mplungjan Feb 25 '19 at 08:29
  • If on the server, you need https://www.npmjs.com/package/jsdom or similar – mplungjan Feb 25 '19 at 08:31
  • On a server. Is there any other way I could do this without using "querySelectorAll" – ZeRoberto Feb 25 '19 at 10:22
  • If there’s no DOM then you need your regex – mplungjan Feb 25 '19 at 10:58
  • How do I modify the regex so that before it matches anything it checks whether it starts with href=" (I can later substring that) [(/\b((?:[a-z][\w-]+:(?:\/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’&quote]))/ig);] – ZeRoberto Feb 25 '19 at 11:24
  • Have a look at the last example – mplungjan Feb 25 '19 at 11:51
  • Thanks. I struggle to comprehend that unusual (in my mind) forEach syntax. I got this working in a RegEx tester: href=\"\b((?:[a-z][\w-]+:(?:\/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’&quote])) but when compiled it doesn't like the " character – ZeRoberto Feb 25 '19 at 12:05
  • `stringFromDB.match(hrefRe).forEach(function(href) { console.log(href.match(urlRe)[0] ) });` – mplungjan Feb 25 '19 at 12:07
0

Firstly, you shouldn't be using RegEx to parse HTML. This answer explains why.

Secondly, you're using getElementsByAttribute incorrectly - it does exactly what it says and gets elements by attributes. You should just use querySelectorAll on all elements with a href, and then map out the hrefs:

var hrefs = document.querySelectorAll("a[href*=http]");
var test = Array.prototype.slice.call(hrefs).map(e => e.href);
console.log(test);
<a href="http://example.com">Example</a>
<a href="http://example1.com">Example 1</a>
<a href="http://example2.com">Example 2</a>
<a href="http://example3.com">Example 3</a>
Jack Bashford
  • 43,180
  • 11
  • 50
  • 79
  • `use querySelectorAll on all elements with a href, and then map out the hrefs:` is what I do in my answer – mplungjan Feb 25 '19 at 08:13
  • The OP mentioned *getElementsByTagName* as an example, they aren'tt using it, so *getElementsByAttribute* is a red herring. They're thinking of parsing HTML, but as you note, that's not a good idea. Better to use *innerHTML* or, if the source is not 100% trustworthy, a *DOMParser* and then use DOM methods to get the links. – RobG Feb 25 '19 at 08:37