0

I currently have a mess of an Apache .htaccess file to (attempt to) ensure a canonical URL for every page on my site, and that URLs do not reflect the underlying technology or whether the page is a file or a directory. For example:

  • recipes/frigidaire.html has the canonical URL of example.com/recipes/frigidaire/; requests for recipes/frigidaire.html, recipes/frigidaire.php, and recipes/frigidaire/index.html (for example) all redirect to the canonical URL.
  • recipes/index.php has the canonical URL of example.com/recipes/ and requests for recipes/index.php, recipes/index.html, etc., all redirect to the canonical URL.

This means that if I later change frigidaire.html to frigidaire.php, for example, the URL does not change. Or, if I add some subpages to frigidaire, so that it goes from being a file to a directory, its URL also does not change.

Any request for a page, if it is not already the canonical URL, is redirected to the canonical URL.

Specifically, I do not want the correct page to be displayed regardless of the URL. I want the correct page to be displayed only with a request for the canonical URL; other requests that are likely for that page should be redirected to the canonical URL.

This is an example of what is in my .htaccess file at the top level of most of my domains. I’ve cut it down considerably. There are a lot more cases covered. For example, there are a lot more file extensions in the real set or rewrites; and the complete set handles issues with content-disposition of some types of files.

#If the request is for an existing directory, we're done
RewriteCond %{REQUEST_FILENAME} -d
RewriteRule ^(.*)$      -       [L]

#Apache lets dynamic files accept slashes at the end
#redirect dynamic file endings to their "directory" equivalent
RewriteCond %{ENV:REDIRECT_STATUS}      ^$
RewriteRule ^(.*)/index\.php/?$    {% page SELF -relative %}$1/   [R=301,L]
RewriteCond %{ENV:REDIRECT_STATUS}      ^$
RewriteRule ^(.*)\.php/?$    {% page SELF -relative %}$1/   [R=301,L]
RewriteCond %{ENV:REDIRECT_STATUS}      ^$
RewriteRule ^(.*)/index/?$    {% page SELF -relative %}$1/   [R=301,L]

#redirect known file endings to their "directory" equivalent
RewriteCond %{ENV:REDIRECT_STATUS}      ^$
RewriteRule ^(.*)\.(au|bin|cpt|doc|dot|dvi|eps|exe|gif|html|jpe?g|manifest|php|pdf|png|ps|ps\.Z|ps\.z|ps\.gz|rss|rtf|rtf\.gz|shtml|sit|sit\.hqx|tar|tar\.gz|tar\.Z|txt|TXT|xhtml|zip.uu)$        {% page SELF -relative %}$1/    [R=301,L]
#need to put .Z on its own, or it takes precedence over ps.Z, etc.
RewriteCond %{ENV:REDIRECT_STATUS}      ^$
RewriteRule ^(.*)\.(hqx|Z|zip)$        {% page SELF -relative %}$1/    [R=301,L]

#if it has no slash after it, add a slash and start over
#we should not have any .html, etc., at this point because we redirected them above
RewriteCond %{ENV:REDIRECT_STATUS}      ^$
#if there is a query string, we can't do this, because the slash will be appended after the query string, not after the filepath
RewriteCond %{QUERY_STRING}      ^$
RewriteCond %{THE_REQUEST}      ({% page SELF -relative %}[^\ ]+)
RewriteRule ^(.*[^/])$  %1/ [R=301,L,NE]

#Get the base path without the slash
RewriteRule ^(.*)/$     $1      [E=NEWCHECK:$1]

#If there is a .html file, use it
RewriteCond {% page SELF -fullpath %}%{ENV:NEWCHECK}.html  -f
RewriteCond %{THE_REQUEST}      ({% page SELF -relative %}[^\ ]+)/
RewriteRule ^(.*)/$     %1.html [L,NE]

#If there is an .xhtml file, use it
RewriteCond {% page SELF -fullpath %}%{ENV:NEWCHECK}.xhtml -f
RewriteRule ^(.*)/$     $1.xhtml [L]

#If there is a .php file, use it
RewriteCond {% page SELF -fullpath %}%{ENV:NEWCHECK}.php -f
RewriteRule ^(.*)/$     $1.php [L]

#If there is a .txt file, use it
RewriteCond {% page SELF -fullpath %}%{ENV:NEWCHECK}.txt -f
RewriteCond %{THE_REQUEST}      ({% page SELF -relative %}[^\ ]+)/
RewriteRule ^(.*)/$     %1.txt [L,NE]

There are a lot of similar questions with solutions out there, all using rewrite rules as I am. Is there an easier way of doing this in Apache without using rewrites? This is a long and tortuous set of rewrites, and I keep thinking that there ought to be a simple .htaccess setting for this. It seems to be a common enough request.

I worry that this kind of rewriting on every request is unnecessarily taxing on the server. I also know that it is very complex, and there remain edge cases that will fail to produce the results I want.

But I can’t find anything like

Options +FolderizeAllURLs +Redirect

I have seen suggestions to use MultiViews.

Options +MultiViews

But this does not appear to handle redirecting to the canonical URL. It also appears to allow for all sorts of weird URLs. For example, recipes/News/2023.php will be displayed with a request for:

  • example.com/recipes/News/2023
  • example.com/recipes/News/2023/index.php
  • example.com/recipes/News/2023/four/score/and/seven/years/ago

Allowing for a never-ending expansion of very erroneous URLs pointing to this particular page as errors build, which is pretty much the exact opposite of what I’m looking for:

  1. one canonical URL will work;
  2. common ways of referencing a URL will redirect to that one canonical URL.
  3. actual incorrect URLs will fail in the normal manner (usually a 404).

Is there a more reliable way of doing this and/or a less server-intensive way of doing this?

Jerry Stratton
  • 3,287
  • 1
  • 22
  • 30

0 Answers0