how to check if my website is being accessed using a crawler?

Question

how to check if a certain page is being accessed from a crawler or a script that fires contineous requests? I need to make sure that the site is only being accessed from a web browser. Thanks.

"make sure that the site is only being accessed from a web browser" could translate to "make sure that the site is only being accessed by a human". This Turing test http://en.wikipedia.org/wiki/Turing_test seemed almost impossible to solve but nowadyas you can call IBM http://en.wikipedia.org/wiki/Watson_(artificial_intelligence_software) — rene, Feb 27 '11 at 19:09

score 2 · Answer 1 · edited May 23 '17 at 11:55

This question is a great place to start: Detecting 'stealth' web-crawlers

Original post:

This would take a bit to engineer a solution.

I can think of three things to look for right off the bat:

One, the user agent. If the spider is google or bing or anything else it will identify it's self.

Two, if the spider is malicious, it will most likely emulate the headers of a normal browser. Finger print it, if it's IE. Use JavaScript to check for an active X object.

Three, take note of what it's accessing and how regularly. If the content takes the average human X amount of seconds to view, then you can use that as a place to start when trying to determine if it's humanly possible to consume the data that fast. This is tricky, you'll most likely have to rely on cookies. An IP can be shared by multiple users.

score 1 · Answer 2 · answered Feb 27 '11 at 18:57

1

You can use the robots.txt file to block access to crawlers, or you can use javascript to detect the browser agent, and switch based on that. If I understood the first option is more appropriate, so:

User-agent: *
Disallow: /

Save that as robots.txt at the site root, and no automated system should check your site.

answered Feb 27 '11 at 18:57

Venatu

1,264
1
13
24

Presumably any code that spoofed the user agent would be able to bypass this? – Jon Egerton Feb 27 '11 at 18:58
5

Note that there is absolutely _no_ guarantee that spiders/bots listen to the `robots.txt`. – Bart Kiers Feb 27 '11 at 18:58
I believe most crawlers don't process JavaScript so any solution using JavaScript (e.g. Google Analytics) will not track crawlers. – geaw35 Feb 27 '11 at 19:01
thank you, but I guess if I create a loop from 0 to 1000 and inside the loop I initiate a request to some website! that will be a problem. This is what I would like to avoid – Ali Tarhini Feb 27 '11 at 19:05
I was unaware that crawlers mostly did not process JavaScript. Thanks for the heads-up! As for them ignoring the robots file, most mainstream crawlers should observe it, but your right its not foolproof. Spoofing the user-agent would not affect this, as it is up to the crawler itself as to whether it respects the file or not. – Venatu Feb 27 '11 at 19:10

score 1 · Answer 3 · answered Sep 12 '11 at 18:34

I had a similar issue in my web application because I created some bulky data in the database for each user that browsed into the site and the crawlers were provoking loads of useless data being created. However I didn't want to deny access to crawlers because I wanted my site indexed and found; I just wanted to avoid creating useless data and reduce the time taken to crawl.

I solved the problem the following ways:

First, I used the HttpBrowserCapabilities.Crawler property from the .NET Framework (since 2.0) which indicates whether the browser is a search engine Web crawler. You can access to it from anywhere in the code:
- ASP.NET C# code behind:
```
bool isCrawler = HttpContext.Current.Request.Browser.Crawler;
```
- ASP.NET HTML:
```
Is crawler? = <%=HttpContext.Current.Request.Browser.Crawler %>
```
- ASP.NET Javascript:
```
<script type="text/javascript">  
var isCrawler = <%=HttpContext.Current.Request.Browser.Crawler.ToString().ToLower() %>  
</script>
```
The problem of this approach is that it is not 100% reliable against unidentified or masked crawlers but maybe it is useful in your case.
After that, I had to find a way to distinguish between automated robots (crawlers, screen scrapers, etc.) and humans and I realised that the solution required some kind of interactivity such as clicking on a button. Well, some of the crawlers do process javascript and it is very obvious they would use the onclick event of a button element but not if it is a non interactive element such as a div. The following is the HTML / Javascript code I used in my web application www.so-much-to-do.com to implement this feature:
```
<div  
class="all rndCorner"  
style="cursor:pointer;border:3;border-style:groove;text-align:center;font-size:medium;font-weight:bold"  
onclick="$TodoApp.$AddSampleTree()">  
Please click here to create your own set of sample tasks to do  
</div>
```
This approach has been working impeccably until now, although crawlers could be changed to be even more clever, maybe after reading this article :D

how to check if my website is being accessed using a crawler?

3 Answers3