Identifying web crawlers

Question

Is the following property reliable enough to identify search engine web crawlers?

Request.Browser.Crawler

My site creates a new user as a guest upon page request if they havent been to the site before and im getting more hits than my analytic's are suggesting. - alot more.

I use the snippet above to only create legit user guest accounts but im thinking some crawlers are getting through.

Perhaps I could use the HttpRequest UserAgent property to identify them. If so can someone please suggest a list of current crawler names, I believe the bing bot for instance is call bingbot as mentioned here.

Request.UserAgent

UPDATE:

I know for sure that they are not being identified using Request.Browser.Crawler because a request coming from 65.52.110.143 is a serial offender, which I believe is a bingbot.

This may be of interest to you: [Asp.net Request.Browser.Crawler - Dynamic Crawler List?](http://stackoverflow.com/questions/431765/asp-net-request-browser-crawler-dynamic-crawler-list) — Jonathon Reinhart, Aug 03 '12 at 04:23
You might also want to take a look at [Detecting 'stealth' web-crawlers](http://stackoverflow.com/questions/233192/detecting-stealth-web-crawlers) There's no accepted answer, but there's alot of great ideas. — Jason Kulatunga, Aug 03 '12 at 04:26

Anirudh Ramanathan · Accepted Answer · 2012-08-03T05:04:16.963

2

Request.Browser.Crawler is sadly out-of-date

You could add detection of other user-agents as bots, manually. Use the Browser Element and not browserCaps as it is deprecated as of .NET 2.0

Example:

<browsers>
    <browser id="Googlebot" parentID="Mozilla">
        <identification>
            <userAgent match="^Googlebot(\-Image)?/(?'version'(?'major'\d+)(?'minor'\.\d+)).*" />
        </identification>
        <capabilities>
            <capability name="crawler" value="true" />
        </capabilities>
    </browser>
    .
    .
    .
</browsers>

This must be saved with a .browser extension under the App_Browsers directory in your application.

(List of Regexes to Match)

edited Aug 03 '12 at 05:04

answered Aug 03 '12 at 04:32

Anirudh Ramanathan

46,179
22
132
191

as im not familiar with this schema can you provide an example? im guessing it would look something upon the lines of: – Christo Aug 03 '12 at 04:56
also is this schema only configurable to the machine config? As I believe adding to the web.config deprecated at the same time. NB. Im deploying to azure so this could be problematic. – Christo Aug 03 '12 at 05:02
Web.config was for the browsercaps, which is deprecated. This can be saved in the `app_browsers` with a .browsers extension. See updated answer. You can use [this file](http://owenbrady.net/browsercaps/CodeProject.xml) for reference – Anirudh Ramanathan Aug 03 '12 at 05:11
If you want this for every website, you can add a *.browser file - the same format as the example above. c:\Windows\Microsoft.NET\Framework64\v4.0.30319\Config\Browsers You can use parentID="Default" and avoid any Mozilla specific settings. Once you create the file, you will need to run aspnet_regbrowsers.exe /i This will compile a DLL and register it into your GAC - now all websites on this machine will have identical crawler recognition. And, I would guess your site would spin up faster too. Downside is this will cause all your application pools to reset. – Larry Dukek Aug 29 '12 at 19:07

Identifying web crawlers

1 Answers1

Request.Browser.Crawler is sadly out-of-date