2

We have a small issue with IIS 8.

WE have been trying to make a cfheader tag work with a 503 error, but each time it seems to generate a simple plain text page generated by IIS.

We have given up on trying to make it nice looking and have come up with a nifty solution. well part of it anyways.

The idea we have is to simply generate a 503 page for bots and a clean cut page for humans when browsing.

Below is the code.

<cfif findNoCase("googlebot", cgi.HTTP_USER_AGENT)>
    <cfset today = dateFormat(now(), 'dd/mm/yy')&timeFormat(now(), 'HH:mm:ss')>
    <cfset urlString = "http://"&cgi.SERVER_NAME>
    <cfif len(trim(cgi.QUERY_STRING))>
      <cfset urlString = urlString&"?"&cgi.QUERY_STRING>
    </cfif>   
    <cfmail to="david.imrie@pistachiomedia.com.au" from="noreply@pistachiomedia.com.au" subject="Google Has Indexed the website #cgi.SERVER_NAME#">
      Google Detected @ #urlString#
    </cfmail>
    <!--- eventually alert the search engine --->
    <cfheader statuscode="503" statustext="Service Temporarily Unavailable"/>
    <cfheader name="retry-after" value="3600" />
<cfelse>

Beautiful page content here

</cfif>

The thing im wondering is... does anyone know of a UDF that will detect for a wider variety of search engines ? As i would like to have the site notify me when ever a search engine is browsing the site.

thanks

user125264
  • 1,809
  • 2
  • 27
  • 54
  • Webtrends, a web analysis software, has currently 592 "search engines" in the keyword.ini file. Maybe http://stackoverflow.com/questions/677419/how-to-detect-search-engine-bots-with-php can help you, even when it is not cfml – da_didi Nov 14 '13 at 12:24
  • 6
    _"As i would like to have the site notify me when ever a search engine is browsing the site."_ - then you're soon going to be getting **a lot** of emails. Much better to use existing software like Webtrends/[Awstats](http://www.awstats.org/)/[Piwik](http://piwik.org/)/etc to see logged and aggregate data. – Peter Boughton Nov 14 '13 at 12:30
  • 4
    Why are you returning a 503 error when bots try to index your site? Are you not wanting the search engines to index your web site? There are other things you can do to prevent that. – Miguel-F Nov 14 '13 at 13:37
  • Wait, I just re-read this - the key thing is the 503; **what the OP actually wants is to display a temporary maintenance page**, but IIS is screwing with their ability to display a human-friendly page _and_ send it as status 503, so they're trying to workaround by using a 200 for humans and a 503 for bots, and now instead of asking for help with the real problem they're asking about the workaround. http://mywiki.wooledge.org/XyProblem – Peter Boughton Nov 15 '13 at 12:59
  • Thanks Peter Boughton, yes you have summed it up correctly. Sorry if my question was a big misleading. – user125264 Nov 16 '13 at 01:43

2 Answers2

0

A few things here:

  1. I'm not sure why you would want to return a 503 error. The bot still takes up some of the same server resources.

  2. You should consider disabling session management (or at least minimizing session timeouts) for bots.

  3. If you are trying to block bots, you should also be using robots.txt ( see http://www.robotstxt.org/ for good information about that).

    Very likely you are already using robots.txt, but that should be noted for anyone coming to this page later.

The UDFs below are based on Ben Nadel's work. The data in it should be kept updated, though.

I might eventually do that following the pattern I used in my own SpamFilter.cfc. For now, though, the following pair of UDFs should get you started.

Note that my UDF treats CFSCHEDULE as a bot because I don't want to use sessions for it. If you want to block all bots, then you should remove that from the list.

<cffunction name="hasCFCookies" access="public" returntype="boolean">
    <cfreturn ( StructKeyExists(Cookie,"CFID") AND StructKeyExists(Cookie,"CFTOKEN") )>
</cffunction>
<cfset request.hasCFCookies = hasCFCookies>

<cffunction name="isBot" access="public" returntype="boolean">
    <!---

    Based on code by Ben Nadel:
    http://www.bennadel.com/blog/154-ColdFusion-Session-Management-Revisited-User-vs-Spider-III.htm
    --->

    <cfset var UserAgent = "">

    <!--- If the user has cookies, this is at least a second request from a real user --->
    <cfif hasCFCookies()>
        <cfreturn false>
    </cfif>

    <!--- Real users have user-agent strings --->
    <cfset UserAgent = LCase( CGI.http_user_agent )>
    <cfif NOT Len(UserAgent)>
        <cfreturn true>
    </cfif>


    <!---
    High-probability checks
    If the user agent has bot or spider in it, it is a bot
    Some specific high-volume spiders listed individually
    --->
    <cfif
            REFind( "bot\b", UserAgent )
        OR  Find( "spider", UserAgent )
        OR  REFind( "search\b", UserAgent )
        OR  UserAgent EQ "CFSCHEDULE"
    >
        <cfreturn true>
    </cfif>

    <!---
    If we haven't yet tagged it as a bot and it is on Windows or Mac (including iOs devices), call it a real user.
    If this results in a few spiders showing as real users that is OK
    --->
    <cfif REFind( "\windows\b", UserAgent ) OR REFind( "\bmac", UserAgent )>
        <cfreturn false>
    </cfif>

    <!--- If we don't know yet, only figure spiders from a known list of a few --->
    <cfif
            REFind( "\brss", UserAgent )
        OR  Find( "slurp", UserAgent )
        OR  Find( "xenu", UserAgent )
        OR  Find( "mediapartners-google", UserAgent )
        OR  Find( "zyborg", UserAgent )
        OR  Find( "emonitor", UserAgent )
        OR  Find( "jeeves", UserAgent )
        OR  Find( "sbider", UserAgent )
        OR  Find( "findlinks", UserAgent )
        OR  Find( "yahooseeker", UserAgent )
        OR  Find( "mmcrawler", UserAgent )
        OR  Find( "jbrowser", UserAgent )
        OR  Find( "java", UserAgent )
        OR  Find( "pmafind", UserAgent )
        OR  Find( "blogbeat", UserAgent )
        OR  Find( "converacrawler", UserAgent )
        OR  Find( "ocelli", UserAgent )
        OR  Find( "labhoo", UserAgent )
        OR  Find( "validator", UserAgent )
        OR  Find( "sproose", UserAgent )
        OR  Find( "ia_archiver", UserAgent )
        OR  Find( "larbin", UserAgent )
        OR  Find( "psycheclone", UserAgent )
        OR  Find( "arachmo", UserAgent )
    >
        <cfreturn true>
    </cfif>

    <cfreturn false>
</cffunction>
Leigh
  • 28,765
  • 10
  • 55
  • 103
Steve Bryant
  • 1,046
  • 5
  • 7
0

You can use BrowscapCFC. I use my own detection library to specifically identify good & bad spiders, but use this for additional detection and it a "Crawler" parameter (as well as "isMobileDevice".) Using this will allow you to identify (and optionally stop the web request) prior to creating any session variables.

http://browscapcfc.riaforge.org/

NOTE: Get the newest browscap.ini files by going to http://browscap.org/

Here are the other values that are returned:

  • ActiveXControls
  • Alpha
  • AolVersion
  • BackgroundSounds
  • Beta
  • Browser
  • Comment
  • Cookies
  • Crawler
  • CssVersion
  • Device_Maker
  • Device_Name
  • Frames
  • IFrames
  • JavaApplets
  • JavaScript
  • MajorVer
  • MinorVer
  • Platform
  • Platform_Description
  • Platform_Version
  • RenderingEngine_Description
  • RenderingEngine_Name
  • RenderingEngine_Version
  • Tables
  • VBScript
  • Version
  • Win16
  • Win32
  • Win64
  • isMobileDevice
  • isSyndicationReader
James Moberg
  • 4,360
  • 1
  • 22
  • 21