Detecting dofollow backlinks using regular expression

Question

The objective of this regular expression is to find whether a web page contains backlink(s) to a given domain and that all of must have a rel="nofollow" attribute on a tag. True if it meets this otherwise False if any does not contain rel="nofollow".

From any web page I want to check whether anything like this is present:

<a ... href="http://www.mysite.com/xyz...." ... >

Addtionally there must not be "rel=nofollow" attribute in all such links found.

Given that domain www.mysite.com is known and I want to check it even within comments or wherever present in the page.

I could do above myself but I'm not able to think of optimized way to it using single pattern.

One unoptimized way I can do it to find all occurances of a tags with href="mysite.com" and see if even single match does not contain a rel=nofollow.

Is there any smart & single line way of making a regular expression pattern?

PS: Don't want to parse DOM since it's risky to miss a backlink due to parsing error and Google's DOM parser could be different. I want human attention to only those pages links from whom can cause backlink penalty from search engines. If a link within comment is flagged as backlink and takes away some human attention, no problem. But at any cost links from say a porn site must be caught. Finally I want to prepare list of spam links which I can submit in Google Webmaster's Disavow tool. This exercise is must for every webmaster once or so in a month for every site. And I can't afford this kind of paid service: www.linkdetox.com

_“Don't want to parse DOM since it's risky to miss a backlink due to parsing error and Google's DOM parser could be different.”_ – when you just do _string_ parsing, there is even more risk that you will not detect it if the link is not really in the document, but only placed inside an HTML comment or something. (And Google is on to people who are just trying to build stupid link farms anyway, just recently they punished two big German “SEO” agencies for that … and since you are trying to verify that a backlink is set, it smells a lot like a “forced” or payed one …) — CBroe, Mar 19 '14 at 09:42
I do want to discount those urls if backlink to my site isn't there. You tell me, you've 1000 backlinks for site and a dozen sites. Should I manually go and see if it is spam one? When only dofollow links can incur penalty, it behooves on us to check only such links. Mine is not paid one. But now site owners are completely responsible for quality of backlinks. Also if a backlink is reported in comment, no problem, such links will be manually checked by me whether or not in comments. Only comment one will waste my time. Imagine a missed link from porn site. — AgA, Mar 19 '14 at 10:19

score 2 · Answer 1 · edited May 23 '17 at 10:26

Usually, parsing HTML with regex is a bad idea (here's the famous reason why). You risk weird bugs as regex aren't able to fully parse HTML.

However, if your input is "safe" (i.e. not changing a lot, or you're prepared for weird errors) and to answer your question, when you're on the a tag you can use something like this to catch link with the href you want and without rel="nofollow":

#<a\s+(?![^>]*rel\s*=\s*(['"])\s*nofollow\s*\1)[^>]*href\s*=\s*(["'])http://www.mysite.com[][\w-.~:/%?#@!$&'()*+,;=]*\2[^>]*>
<a\s+                        # start of the a tag followed by at least a space
(?!                          # negative look-ahead: if there isn't...
    [^>]*                    # anything except tag closing bracket
    rel\s*=\s*               # 'rel=', with spaces allowed
    (['"])                   # capture the opening quote
    \s*nofollow\s*           # nofollow
    \1                       # closing quote is the same as captured opening one
)                            # end of negative look ahead
[^>]*                        # anything but a closing tag
href\s*=\s*                  #
(["'])                       # capture opening quote
http://www.mysite.com        # the fixed part of your url
[][\w-.~:%/?#@!$&'()*+,;=]*   # url-allowed characters
\2                           # closing quote
[^>]*>                       # "checks" that the tag is ending

Demo: http://regex101.com/r/hC8lV9

Disclaimer

This isn't meant to check whether your input is well-formed or not, this assumes it is well formed. This won't account for stuff like escaped > or escaped quotes, and you very probably will need to adapt it to your needs. Basically, no regex will give a complete answer.

If you need to deal with various input or with potentially malformed HTML, a parser will will do a much safer and better job than regex.

However I'm putting this one here to give you an idea of what can be done on this subject, since in very strict and narrowly defined context regex can actually be a relevant solution.

score 1 · Answer 2 · answered Mar 19 '14 at 08:59

1

First of all do not use regular expression for parsing the dom of a web page. PHP got it 's own Document Object Model, which does the whole job. Just have a look at http://de1.php.net/manual/en/class.domdocument.php and http://de1.php.net/manual/en/class.domxpath.php.

answered Mar 19 '14 at 08:59

Marcel

4,854
1
14
24

It does not matter, if the dom is valid or not. DomDocument::loadHTMLFile() does not care, if html is well formed or not. – Marcel Mar 19 '14 at 09:06
Well it does try to backtrack and correct in case of error. This backtrack & correction intelligence would definitely be different for a search engine. – AgA Mar 19 '14 at 09:08

Vasili Syrakis · Answer 3 · 2014-03-25T22:23:43.383

1

Regular Expression

<a(?=[^>]*?rel=nofollow)(?=[^>]*?href="http:\/\/www\.mysite\.com\/.*?")[^>]*?>

How it works

It uses positive lookaheads to validate the string for the rel=nofollow and href="mysite tags.

Online demo: `http://regex101.com/r/pX0yF5`

edited Mar 25 '14 at 22:23

answered Mar 25 '14 at 09:50

Vasili Syrakis

9,321
1
39
56

Matches `` in ` rel=nofollow href="http://www.mysite.com/"`. You should try replacing your `.` with `[^>]` – Robin Mar 25 '14 at 22:16
The nofollow rel is between "" (rel="nofollow") The correct answer is ]*?rel="nofollow")(?=[^>]*?href="http:\/\/www\.mysite\.com\/.*?")[^>]*?> – Iñaki Soria Mar 31 '14 at 21:50

score -1 · Answer 4 · answered Jul 29 '17 at 14:23

If you’ve been doing any kind of reading about link building, then you’ve probably seen people mentioning nofollow and dofollow links. These are very important terms to understand when you are trying to build great links back to your site in order to increase your search engine rankings. But, to the person who is new to all of this, it may be kind of confusing. I am going to help break it down for you.

To tell the spiders to crawl a link, you don’t have to do anything. Simply using the format shown above, the search engine spiders will crawl the link provided.

Detecting dofollow backlinks using regular expression

4 Answers4

Regular Expression

How it works

Online demo: http://regex101.com/r/pX0yF5

Online demo: `http://regex101.com/r/pX0yF5`