6

On my website, I have 1000 products, and they all have their own web page which are accessible by something like product.php?id=PRODUCT_ID.

On all of these pages, I have a link which has a url action.php?id=PRODUCT_ID&referer=CURRNT_PAGE_URL .. so if I am visiting product.php?id=100 this url becomes action.php?prod_id=100&referer=/product.php?id=1000 clicking on this url returns the user back to referer

Now, the problem I am facing is that I keep getting false hits from spiders. Is there any way by which I can avoid these false hits? I know I can "diallow" this url in robots.txt but still there are bots who ignore this. What would you recommend? Any ideas are welcome. Thanks

Kay
  • 65
  • 1
  • 5

5 Answers5

3

Currently, the easiest way of making a link inaccessible to 99% of robots (even those that choose to ignore robots.txt) is with Javascript. Add some unobtrusive jQuery:

<script type="text/javascript">
$(document).ready(function() {
    $('a[data-href]').attr('href', $(this).attr('data-href'));
  });
});
</script>

The construct your links in the following fashion.

<a href="" rel="nofollow" data-href="action.php?id=PRODUCT_ID&referrer=REFERRER">Click me!</a>

Because the href attribute is only written after the DOM is ready, robots won't find anything to follow.

cantlin
  • 3,236
  • 3
  • 21
  • 22
  • this solutions requires javascript for the website to work, introduces invalid markup and will hurt your SEO very *very* badly. (google needs those URLs as well) – Jacco Mar 25 '11 at 12:11
  • @Dae .. I think this is the only option left for me.. @Jacco I understand .. but as I am already having a Disallow in robots.txt to actions.php I dont think its going to hurt SEO etc. I agree on invalid markup issue though but I think I can get away with that!! – Kay Mar 25 '11 at 12:16
  • @Jacco It is [valid HTML5](http://dev.w3.org/html5/spec/elements.html#embedding-custom-non-visible-data-with-the-data-attributes). I have no idea why you think it will hurt SEO, from the question Google certainly does not need these URLs. Javascript is not required for the website to work, it is required for this link to work. – cantlin Mar 25 '11 at 12:17
  • Search engines use the `href="/somewhere.html"` to build there ranking and discover new pages on the internet. In you example, the `href=""` is empty, so all those important links to all the different product pages are lacking. It will stop (most) robots -including google- from following the hidden URLs. Each page/product should have a *unique* URL; the proposed solution is the wrong fix for a bad design-decision. – Jacco Mar 25 '11 at 12:22
  • My reasonable assumption from the question was that links to product.php handle _displaying content_, while links to action.php _act on content_. It is common practice to restrict crawler access pages of the latter type. That said, any page that creates a database row on every unauthenticated page view is a denial of service waiting to happen -- tracking this kind of inconsequential information in $_COOKIE is a much better idea. – cantlin Mar 25 '11 at 12:31
2

Your problem consists of 2 separate issues:

  1. multiple URLs lead to the same resource
  2. crawlers don't respect robots.txt

The second issue is hard to tackle, read Detecting 'stealth' web-crawlers

The first one is easier. You seem to need an option to let the user go back to the previous page.

I'm not sure why you do not let the browser's history take care of this (through the use of the back-button and javascript's history.back();), but there are enough valid reasons out there.

Why not use the refferer header?
Almost all common browser send information about the referring page with every request. It can be spoofed, but for the mayority of visitors this should be a working solution.

Why not use a cookie?
If you store the CURRNT_PAGE_URL in a cookie, you can still use a single unique URLs for each page, and still dynamically create breadcrumbs and back links based on the refferer set in the cookie, and not be dependent on the HTTP-referrer value.

Community
  • 1
  • 1
Jacco
  • 23,534
  • 17
  • 88
  • 105
  • actually the actions.php stores interests of users in products. So if I am not a logged in user I can have interests in some products and later if I log in or register, these interests get added to my user profile for some actions. If only human visits this page and dont log in after showing interests I dont get large database tables.. but as the spiders are visiting everyday and multiple times I keep getting large database tables which I want to avoid. – Kay Mar 25 '11 at 12:12
  • @Kay, seems like you should use the cookie option. Maybe empty the tables after a week or so. – Jacco Mar 25 '11 at 12:15
  • Cookie idea is good. Yes, I have just setup a cron job to empty the tables .. just for last week there were more than 50,000 spider entries!!! – Kay Mar 25 '11 at 12:23
1

Another option is to use PHP to detect bots visiting your page.

You could use this PHP function to detect the bot (this gets most of them):

function bot_detected() {
  return (
    isset($_SERVER['HTTP_USER_AGENT'])
    && preg_match('/bot|crawl|slurp|spider|mediapartners/i', $_SERVER['HTTP_USER_AGENT'])
  );
}

And then echo href links to page only when you find that the user is not a bot:

if (bot_detected()===false)) {
echo "http://example.com/yourpage";
}
jjj
  • 2,594
  • 7
  • 36
  • 57
1

You can use the robots.txt file to prevent complying bots.

Next thing you can do, once robots.txt is configured is to check your server logs. Find any useragents that seem suspicious.

Let's say you find evil_webspider_crawling_everywhere as a useragent, you can check for it in the headers of the request (sorry, no example, haven't used php in a long time) and deny access to the webspider.

Martin
  • 5,954
  • 5
  • 30
  • 46
0

I don't believe you can stop user agents that don't obey your advice.

Before going down this route I would really want to make ascertain that bots/spiders are a problem - doing anything that prevents natural navigation of your site should be seen as a last resort.

If your really want to stop spiders what you might want to consider is using javascript in the your links so that navigation only happens after the link is clicked. This should stop spiders.

Personally I'm not fussed about spiders or bots.

Xhalent
  • 3,914
  • 22
  • 21