1

I have all query strings which I need already rewritten to SEO friendly URLs, like

RewriteRule ^item_([0-9]+)/$ database.php?type=product&id=$1 [L]
RewriteRule ^post_([0-9]+)/$ articles.php?id=$1 [L]
... and so on

but I would like to strip any other query strings like item_123/?foo=bar or database.php?foo=bar or post_123/?type=product&id=321 for both SEO and security reasons.

The apparently obvious solution of placing

RewriteCond %{QUERY_STRING} (.+)
RewriteRule (.*) http://www.example.com/$1? [R=301,L]

in the end of .htaccess to deal with everything that has not bean dealt before and stopped by [L] tags actually breaks the original RewriteRule and redirects item_123/ to an empty database.php with no parameters.

Is it possible to remove all query strings except for those already mod_rewritten earlier without explicitly writing down exceptions for all pairs of %{REQUEST_URI}s and %{QUERY_STRING}s?

Edit:

Solution A

# You do not need this whole block if you're running Apache v2.3.9+
RequestHeader set SOME-FANCY-NAME-FOR-THE-HEADER-AS-DESCRIBED-IN-THE-ABOVE-LINK 1 env=END

RewriteCond %{HTTP:SOME-FANCY-NAME-FOR-THE-HEADER-AS-DESCRIBED-IN-THE-ABOVE-LINK} =1 [NV]
RewriteRule .* - [L]

As the [END] flag works only on Apache v2.3.9+, I used a workaround which would emulate this behaviour.

# Replace [L,E=END:1] with [END] if running Apache v2.3.9+
RewriteCond %{THE_REQUEST} ^GET\ [^?]+$
RewriteRule ^item_([0-9]+)/$ database.php?type=product&id=$1 [L,E=END:1]

Simply restricting any ? in THE_REQUEST in the first place will make duplicate pages of item_123/?foo=bar pattern not found (404). The [L,E=END:1] flag tells mod_rewrite to stop current iteration and reiterate; the next iteration will trigger RewriteRule .* - [L] and block it from reaching the potentional loop we have afterwards. The [END] flag, if supported, would stop it straight away.

RewriteCond %{QUERY_STRING} type=product
RewriteCond %{QUERY_STRING} id=([0-9]+)
RewriteRule ^database\.php$ http://www.example.com/item_%1/? [R=301,L]

This will also redirect (301) the potentially compromised duplicate pages of database.php?type=product&foo=bar&id=123 pattern to the correct URL regardless of gibberish paramaters in the query. Once it reaches the correct URL, it will stop there without causing a loop and error 500.

# If page is accessible without parameters

RewriteCond %{THE_REQUEST} ^GET\ [^?]+$
RewriteRule ^catalog/$ database.php [L,E=END:1]

RewriteCond %{THE_REQUEST} ^GET\ [^?]+\?
RewriteRule ^database\.php$ http://www.example.com/catalog/? [R=301,L]

If the page is accessible without parameters like ?type and &type above but accessed as database.php?foo=bar or database.php?, it will be redirected (301) to catalog/ without the query string. Again, a page of catalog/?foo=bar pattern will not be found (404).

# If page is not accessible without parameters

RewriteCond %{THE_REQUEST} ^GET\ [^?]+\?
RewriteRule ^database(\.php|/)?$ database.php [L,E=END:1]

If the page is not accessible without parameters, we can force stop rewriting (to avoid unnecessary redirects later on if e.g. we have anyotherfile.php rewritten to anyotherfile/) and make the page send a 404 header itself once it knows that no valid parameters have been passed.

Solution A+B

The code from the accepted solution is correct by itself, while my version extends rewriting to match many other malformed patterns.

Adding the code from the accepted solution after all of the above code will capture the (previously) not found links of item_123/?foo=bar and catalog/?foo=bar pattern and redirect them (301) to the correct URLs item_123/ and catalog/ without the query strings. This makes sense, as the user will get to where he wants even if he follows a link modified by some RSS aggregators or such. Changing %{QUERY_STRING} (.+) to %{THE_REQUEST} ^GET\ [.?]+\? along with using %{THE_REQUEST} ^GET\ [^?]+$ instead of %{QUERY_STRING} ^$ in the above code will also remove trailing question marks - item_123/? - which would otherwise be overlooked and counted as duplicate pages if adressed.

RewriteCond %{THE_REQUEST} ^GET\ [^?]+\?
RewriteRule (.*) http://www.example.com/$1? [R=301,L]
Community
  • 1
  • 1
  • Take care with wildcard rewrites (the last rule you have in question), and take note of the following: [Mod_Rewrite unexpected behavior L flag](http://stackoverflow.com/a/12106419/367456) – hakre Aug 29 '13 at 20:30

3 Answers3

2

The L flag does not stop. It re-injects if you changed the URL (which you did). Therefore then for every internal redirect (rewrite) you did, that very last condition is OK and then the very last rewrite triggered:

RewriteCond %{QUERY_STRING} (.+)
RewriteRule (.*) http://www.example.com/$1? [R=301,L]

As this one does cut away the query string (ends with ?, no QSA flag) you end with the php script without parameters:

rewrite #1/1: item_5/ -> database.php?type=product&id=5
              L triggered, because URL changed, re-inject:
rewrite #1/2: database.php?type=product&id=5 -> http://www.example.com/database.php?
              R triggered, exiting

rewrite #2/1: http://www.example.com/database.php? -
              no rule matches, use as-is

Instead you need to place a condition at the end to not redirect on .php files:

RewriteCond %{QUERY_STRING} (.+)
RewriteCond %{REQUEST_URI} !^/[a-z]+\.php$    
RewriteRule (.*) http://www.example.com/$1? [R=301,L]

or if you've got a more modern apache server version, just use the END flag:

RewriteRule ^item_([0-9]+)/$ database.php?type=product&id=$1 [END]
RewriteRule ^post_([0-9]+)/$ articles.php?id=$1 [END]
... and so on
hakre
  • 193,403
  • 52
  • 435
  • 836
  • A preceding slash was required in REQUEST_URI for the rule to work - `RewriteCond %{REQUEST_URI} !^/[a-z]+\.php$` Apart from that, it explains it perfectly. Had to go with the first option plus some voodoo since the [END] flag, which is exactly what I need, was not supported. – obento not ubuntu Aug 30 '13 at 09:24
  • Ah yes, the request URI is normally preceeded by that, a little oversight. Corrected the answer now. – hakre Aug 30 '13 at 10:47
0

I don't know if this helps or not but how I handle things is to send files that don't exist to a specific php file (rewrite.php)

RewriteCond %{SCRIPT_FILENAME} !-d
RewriteCond %{SCRIPT_FILENAME} !-f
RewriteRule ^.*$ ./rewrite.php

This lets me handle pretty every case I have come across easily

hendr1x
  • 1,470
  • 1
  • 14
  • 23
0

You can avoid this by using:

RewriteRule ^item_([0-9]+)/.*$ abc.php?type=product&id=$1 [L]

I added .* to match anything after slash but it still valid pattern for your redirect.

  • This a) ignores the query and b) allows duplicate pages like `item_123/abcd` and `item_123/?foo=bar` instead of throwing 404 not found or 301 redirecting to `item_123/`. – obento not ubuntu Aug 30 '13 at 09:56
  • are you sure? It redirects all valid `item_([0-9]+)/` and ignores everything after / –  Aug 30 '13 at 11:09
  • It does not redirect (`[R=301,L]` does), it rewrites (or, ok, internally redirects) the path. So it displays the same content for `item_123/`, `item_123/abcd` and `item_123/?foo=bar`, which is not the desired behaviour. In other words, I want to _get rid_ of the .* part in the URL, not _ignore_ it. – obento not ubuntu Aug 30 '13 at 11:48