2

Apparently Bingbot is getting caught in an infinite loop on my site. It downloads pages like http://www.htmlcodetutorial.com/quicklist.html/applets/applets/applets/applets/applets/applets/applets/applets/applets/applets/applets/applets/applets/applets/sounds/forms/linking/frames/document/linking/images/_AREA_onMouseOver.html . Since I set my server to interpret .html as PHP the page is simply a copy of http://www.htmlcodetutorial.com/quicklist.html . How do I stop Bingbot from looking for these bogus copies?

Why is Bingbot looking for those pages to begin with?

I'd like to do something like the last line of the .htaccess file shown below (like at "Redirect to Apache built-in 404 page with mod_rewrite?"), but when I try RewriteRule ^.*\.html\/.*$ - [R=404] the entire site shows a 500 error.

Even when I use the last line below it redirects to http://www.htmlcodetutorial.com/home/htmlcode/public_html/help.html which is not what I wanted.

AddType application/x-httpd-php .php .html

RewriteEngine on 
Options +FollowSymlinks

RewriteRule ^help\/.* help.html [L]

RewriteCond %{HTTP_HOST} ^example.com
RewriteRule (.*) http://www.htmlcodetutorial.com/$1 [R=301,L]

ErrorDocument 404 /404.html

RewriteRule ^.*\.html\/.*$ help.html [R=301]

P.S. I know the site is way out of date.

Community
  • 1
  • 1
zylstra
  • 740
  • 1
  • 8
  • 22

2 Answers2

0

Change your last rule to this:

RewriteRule ^(.+?\.html)/.+$ - [R=404,L,NC]
anubhava
  • 761,203
  • 64
  • 569
  • 643
  • Thanks that works. I'd like to serve a 404 code though, but R=404 causes a 500 error throughout the site. Any idea why that may be? Also, why is Bingbot looking for those pages to begin with? – zylstra Nov 27 '13 at 01:29
  • Have you verified that this works? Apache docs state that it does not. – zylstra Nov 29 '13 at 08:59
  • Of course I have tested it thoroughly before posting here. I like to know which Apache document says it doesn't work. – anubhava Nov 29 '13 at 10:24
  • @zylstra: Another thing you need to keep in mind with `R=404` you will not see changed URL in your browser since Apache just does the internal rewrite. But if you run that in FIrebug you will proper 404 status coming back. – anubhava Nov 29 '13 at 10:34
  • For Apache 2.2 the doc is here, http://httpd.apache.org/docs/current/rewrite/flags.html#flag_r , "...if a status code is outside the redirect range (300-399) then the substitution string is dropped entirely..." For 2.0 the doc is here, http://httpd.apache.org/docs/2.0/mod/mod_rewrite.html#rewriteflags , "If you want to use other response codes in the range 300-400, simply specify the appropriate number..." – zylstra Nov 30 '13 at 03:11
  • Yes I have already read that sections and made my comment accordingly. See my previous comments about status code. – anubhava Nov 30 '13 at 04:27
0

The problem here is that you either have Multiviews turned on, or apache is interpreting requests like /quicklist.html/blah/blah as a PATH_INFO style request, which will be interpreted as a valid request.

So turn off multiviews by changing your options line to:

Options +FollowSymlinks -Multiviews

Then replace your last rule with:

RewriteCond %{DOCUMENT_ROOT}%{REQUEST_URI} !-f
RewriteCond %{DOCUMENT_ROOT}%{REQUEST_URI} !-d
RewriteRule ^ - [L,R=404]
Jon Lin
  • 142,182
  • 29
  • 220
  • 220
  • The `R=404` causes a 500 error throughout the site. Any idea why that may be? Also, why is Bingbot looking for those pages to begin with? – zylstra Nov 27 '13 at 01:26
  • @zylstra what version of apache is your server? `R=404` works fine for me. No idea why bingbot is trying to request URLs with chained paths – Jon Lin Nov 27 '13 at 03:22
  • `Server built: Feb 28 2012 21:55:00 Cpanel::Easy::Apache v3.9.2 rev9999` I guess this is another question though. If you say 404 should work I'll try to debug that part of it. – zylstra Nov 27 '13 at 17:43
  • @zylstra that's the version of cpanel. There's [a number of ways to find the version of apache](http://stackoverflow.com/questions/166607/how-do-i-find-the-version-of-apache-running-without-access-to-the-command-line) but honestly, I'm not sure which versions support `R=404` and which doesn't. Apache 2.2 definitely does – Jon Lin Nov 27 '13 at 17:53
  • I must have not copied all three lines, only the last two. My Apache version is: Apache/2.0.64. However, the official Apache reference documents for both 2.0 and 2.2 state that "if a status code is outside the redirect range (300-399) then the substitution string is dropped entirely, and rewriting is stopped as if the L were used." So are you stating that you can see the page with a 200 response without the rewrite, a 301 response with a R=301 rewrite, and a code 404 with a R=404? – zylstra Nov 29 '13 at 08:59
  • @zylstra yes, with those rules in a blank htaccess file, and apache 2.2, [it works for me](http://i.stack.imgur.com/rvNkB.png) – Jon Lin Nov 29 '13 at 22:07