0

I have a problem with someone (using many IP addresses) browsing all over my shop using:

example.com/catalog/category/view/id/$i

I have URL rewrite turned on, so the usual human browsing looks "friendly":

example.com/category_name.html

Therefore, the question is - how to prevent from browsing the shop using "old" (not rewritten) URLs, leaving only "friendly" URLs allowed?

This is pretty important, since it is using hundreds of threads which is causing the shop to work really slow.

Cleankod
  • 2,220
  • 5
  • 32
  • 52

3 Answers3

1

Since there are many random IP addresses, clearly you can't just block access from a single or small group of addresses. You may need to implement some logging that somehow identifies this crawler uniquely (maybe by browser agent, or possibly with some clever use of the Modernizr javascript library).

Once you've been able to distinguish some unique identifiers of this crawler, you could probably use a rule in .htaccess (if it's a user agent thing) to redirect or otherwise prevent them from consuming your server's oomph.

This SO question provides details on rules for user agents.

Block all bots/crawlers/spiders for a special directory with htaccess

Community
  • 1
  • 1
pspahn
  • 2,770
  • 4
  • 40
  • 60
  • Depending on how @Spyro manages his ecommerce company, blocking all bots/crawlers could be a bad idea ... We have crawlers from our vendors that crawl our site to check for many things, from page availability (no 404's) and correct stock status etc ... If you are going to sell 2500 of their products, they *should* be able to check on them. Blocking traffic like that is semi dangerous as it's difficult to weed out the "friendlies" ... – Zak Feb 26 '13 at 20:28
  • 2
    Of course, which is why I suggested implementing some logs that check for a unique identifier of this particular crawler. If this is a legitimate crawler, it might have some benefits, but if it is slowing his site down to a point where it is not usable by actual people, then that crawler should be block because of its intrusiveness. If the crawler is illegitimate, and it has a unique identifier, then it needs to be shown the door. – pspahn Feb 26 '13 at 20:37
1

If the spider crawls all the urls of the given pattern:

example.com/catalog/category/view/id/$i

then you can just kill these urls in a .htaccess. The rewrite is made internally from category.html -> /catalog/category/view/id/$i so, you only block the bots.

Fabian Blechschmidt
  • 4,113
  • 22
  • 39
  • I already tried that, but what rule would block the url when the dir does not exists and other rule is rewriting those to actual invocations of the front controller? – Cleankod Feb 27 '13 at 06:33
  • If you don't have bad written modules, every module should use the category.html link for a redirect (301,302). So you can just block this url - I think. You can try something like this: deny all – Fabian Blechschmidt Feb 27 '13 at 08:34
0

Once the rewrites are there ... They are there. They are stored in the Mage database for many reasons. One is crawlers like the one crawling your site. Another is users that might have the old page bookmarked. There are a number of methods individuals have come up with to go through and clean up your redirects (Google) ... But as it stands, in Magento, once they are there, they are not easily managed using Magento.

I might suggest generating a new site map and submitting it to the crawler affecting your site. Not only is this crawler going to be crawling tons of pages it doesn't need to, it's going to see duplicate content (bad ju ju).

Zak
  • 6,976
  • 2
  • 26
  • 48
  • I've checked the IPs and most of them belong to OVH hosting company in France. Now correct me if I'm wrong, but I doubt that legitimate crawlers are using hosting servers to work on... ;) – Cleankod Feb 27 '13 at 06:30
  • Moreover, my shop started with URL rewrite turned on from the very first day of its life. Therefore, I doubt that someone may have "old" link in their bookmarks. As for the sitemap, I do have it already and Google is not a problem here. – Cleankod Feb 27 '13 at 06:32