-1

I'm trying to understand this code:

function extractLinks(input) {
    var html = input.join('\n');
    var regex = /<a\s+([^>]+\s+)?href\s*=\s*('([^']*)'|"([^"]*)|([^\s>]+))[^>]*>/g;
    var match;
    while (match = regex.exec(html)) {
        var hrefValue = match[3];
        if (hrefValue == undefined) {
            var hrefValue = match[4];
        }
        if (hrefValue == undefined) {
            var hrefValue = match[5];
        }
        console.log(hrefValue);
    }
}

By all means, this is a simple function, that extracts all href values, but only these, which are real hrefs, e.g. href that is defined as class="href", or outside A tag, etc. are not included. The thing that is weird about all this, is that the regex I created for this calculation is (<a[\s\S]*?>) but when I didn't manage to find a solution, and looked at the original one, I found this very long regex. Tried this solution with my regex, it won't work.

Can please, someone explain, how can I interpret this long regex. And then, match returns an array, well. Let me see If I get the idea of this while loop:

while ( match = the regex is present in the string) { something = match[3] / why 3???/ and then if undefined something = match[4], if undefined again something = match[5]; }

I do really struggle to understand the mechanism behind all of this, as well as the logic in the regex.

The input is generated by a system, which will parse 10 different arrays of strings, but lets take one, which I use to test: The code below is parsed as array of strings with length as the lines, every line is a separate element in the array, and this is the argument input for the function.

<!DOCTYPE html>
<html>
<head>
  <title>Hyperlinks</title>
  <link href="theme.css" rel="stylesheet" />
</head>
<body>
<ul><li><a   href="/"  id="home">Home</a></li><li><a
 class="selected" href=/courses>Courses</a>
</li><li><a href = 
'/forum' >Forum</a></li><li><a class="href"
onclick="go()" href= "#">Forum</a></li>
<li><a id="js" href =
"javascript:alert('hi yo')" class="new">click</a></li>
<li><a id='nakov' href =
http://www.nakov.com class='new'>nak</a></li></ul>
<a href="#empty"></a>
<a id="href">href='fake'<img src='http://abv.bg/i.gif' 
alt='abv'/></a><a href="#">&lt;a href='hello'&gt;</a>
<!-- This code is commented:
  <a href="#commented">commentex hyperlink</a> -->
</body>
Alex Kulinkovich
  • 4,408
  • 15
  • 46
  • 50
Sineastra
  • 31
  • 1
  • 6
  • 5
    [**ALL IS LOŚ͖̩͇̗̪̏̈́T ALL I​S LOST the pon̷y he comes **](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – adeneo Nov 14 '14 at 23:25

1 Answers1

1

For an understanding of what this regex is doing, I have put inline comments in this page that you can review. I'm also copying it here:

<a\s+            # Look for '<a' followed by whitespace
([^>]+\s+)?      # Look for anything else that isn't 'href='
                 # such as 'class=' or 'id='
href\s*=\s*      # locate the 'href=' with any whitespace around the '=' character
(
  '([^']*)'      # Look for '...'
|                # ...or...
  "([^"]*)       # Look for "..."
|                # ...or...
  ([^\s>]+)      # Look anything NOT '>' or spaces
)
[^>]*>           # Match anything else up to the closing '>'

This is just to break it apart so you can see what each of these portions are doing. As far as your questions about the match, I don't fully understand your question.

OnlineCop
  • 4,019
  • 23
  • 35
  • Well, thanks for the regex, will have a look. And the part 'bout the while loop is, why we take the third element of the array match, and if it is undefined we go for the 4th, then 5th. – Sineastra Nov 14 '14 at 23:41
  • I think what is happening here is that there are parts of the URL that are being "captured" that aren't necessary to be kept. [This one](http://regex101.com/r/qQ3nA9/3) has a few changes where it ONLY captures the `href=` portion. In that case, you can see the replacement at the bottom of that page. – OnlineCop Nov 14 '14 at 23:49
  • You sir, have my gratitude. – Sineastra Nov 14 '14 at 23:54