Accurately count number of links in PHP and Javascript

Question

I have a form that I am validating with JS on the front-end and PHP on the server side. What I need is a way to reliably count the number of links in an HTML string. The best way that I could think of was to count the closing tags. However simply searching for this tag will not work because the user could circumvent the validation by adding spaces like so: </a >.

I am fairly new to regex and this is the pattern that I have been able to come up with so far:

<[ \n\t]*\/[ \n\t]*a[ \n\t]*>

In Javascript:

function link_count(s){
    return s.match(/<[ \n\t]*\/[ \n\t]*a[ \n\t]*>/g, s).length;
}

In PHP:

function count_links($str){
    return preg_match_all('<[ \n\t]*/[ \n\t]*a[ \n\t]*>', $str, $matches);
}

Is this the best approach? Will it affect the performance of my form (the html string could be very long)? I am looking for the most efficient and reliable solution.

Thanks in advance.

A sample of the string you are trying to regex would be very useful — RiggsFolly, Jul 25 '14 at 15:32
regex is not a good way to parse html, but it will probably work for what you're doing. that said, not all `a` tags are links - maybe try looking for the strings `href="` and `href='` — user428517, Jul 25 '14 at 15:33
regexp is fine for some/most html: it's recursive repeating tags that blow it's mind. since doesn't nest, regexp would be fine. you should be able to search for /<\/a\b/, because anything else after that won't be an anchor tag — dandavis, Jul 25 '14 at 15:47
I feel this MUST be linked...http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — MasNotsram, Jul 25 '14 at 15:55
i purposely don't link that on every regex html question. regex will work just fine in many simple cases where people want to parse html (such as this one) - i do it all the time for quick tasks like this. there's nothing wrong with it, as long as you know the limitations. — user428517, Jul 25 '14 at 15:56
Thanks for all your comments. Like I said, I am pretty new to regex so I don't know all its pros and cons yet. — Hasan Akhtar, Jul 25 '14 at 18:52

lcoderre · Answer 1 · 2014-07-25T16:01:06.840

0

So, like @sgroves said, </a> are not all links. checking for href might be more interesting.
Also, why not checking the opening tag directly? I tried searching for <a .... href>

You might use the 's' modifier to ignore newlines...

/<\s*\ba\b.*?href/gs

http://regex101.com/r/bG8lN1/3

edited Jul 25 '14 at 16:01

answered Jul 25 '14 at 15:52

lcoderre

1,304
9
16

`[\w\W\s]` seems... odd. "Anything that is a word character or not a word character or a space"... – Niet the Dark Absol Jul 25 '14 at 15:55
Right, I didn't realize I ended up with that. At this point, with the ungreedy operator, we could change that for `.*`. I will update. – lcoderre Jul 25 '14 at 15:57
Does \s take care of all the white space characters? When I press enter after ' – Hasan Akhtar Jul 25 '14 at 19:23

Accurately count number of links in PHP and Javascript

1 Answers1