1

I have a website, which is a one-pager JS-based site. It has different "subsites" and "categories", but since everything is controlled via JS, and since there are about 30 subsites and 50 categories, I have never created a "normal" subsite system, just like

www.example.com/subsite/category

Instead I have only the main site

www.example.com

and that's all, everything else is controlled via JS.

But I want to achieve better results on Google ranking, and for that I need to create subpages as well. I want to keep the JS-based behavior, and that part is ready to handle the different URLs (www.example.com/subsite/category) the right way: it is checking the URL, takes the subsite and the category, and passes to the right JS as parameters. So my one-pager site acts like a multi-pager. And it's fine in this way.

At this point my .htaccess redirects all non-existing directories to the home page, keeping the URL itself unchanged, so the JS can use it properly.

RewriteCond %{REQUEST_FILENAME} !-d
RewriteCond %{REQUEST_FILENAME} !-f
RewriteRule . index.html [L]

But I want to handle the non-existing subsites and categories as well, I want to 404 them. And I need to handle it via .htaccess.

So I thought maybe there is a way to handle the existing subsites and categories as variables in .htaccess, where all of their combinations are accepted, but all other goes to 404.

For example (in JS, since I don't know how to handle this in .htaccess):

var subsitesArray = ["foosite", "barsite"];
var categoriesArray = ["foocategory", "barcategory"];

So the valid URLs would be:

www.example.com/foosite/foocategory
www.example.com/foosite/barcategory
www.example.com/barsite/foocategory
www.example.com/barsite/barcategory

And all other would be non-valid, so 404.

If I would have to set all URLs manually, that would be 30*50 URLs... That's way too much.

Is it possible to solve it somehow in .htaccess?

UPDATE#3 Please update the code to support the following points:

  • The /site1 ... /site30, /category1 ... /category50 subsites are available on the server (index.html in these directories), so the .htaccess rules should not forward them to index.html (but let the "physical" files to be opened).
  • So only /site1/category1 ... /site1/category50 ... /site30/category1 ... /site30/category50 variants should be redirected to index.html.
  • www.example.com/////site1///category2 (so a lot of / characters in-between) are still accepted, however should not be.
  • When the link ends with a / character, it's not accepted, however it should be. www.example.com/site1/category is accepted, but www.example.com/site1/category/ (ending with /) is not, however it should be.

Can you please update the code? These would be the final modifications, and it would work perfectly.

Thank you in advance.

lpasztor
  • 157
  • 13

1 Answers1

2

This would seem to be a continuation of your earlier question (except that previously you had "site" and "category" separated by a hyphen in the URL and "site" and "category" could also seemingly contain a hyphen (at least in your example), which would have made checking these independently in .htaccess pretty much impossible.)

www.example.com/<site>/<category>

Following this URL pattern you could validate the <site> and <category> separately, so you would end up with 30 + 50 (80) directives, as opposed to 30 * 50 (1500) directives if you were to do this one-by-one.

For example, you could do something like the following in .htaccess (this replaces your existing rule):

DirectoryIndex index.html

RewriteEngine On

# If the request is not of the form "/site" or "/site/category" then stop here
RewriteRule !^[^/.]+(/[^/.]+)?$ - [L]

# Validate "site" (first path segment)
RewriteCond $1 !=site1
RewriteCond $1 !=site2
RewriteCond $1 !=site3
# etc.
RewriteCond $1 !=site30
RewriteRule ^([^/.]+) - [R=404]

# Validate "category" (second path segment)
RewriteCond $1 !=category1
RewriteCond $1 !=category2
RewriteCond $1 !=category3
# etc.
RewriteCond $1 !=category50
RewriteRule ^[^/.]+/([^/.]+)$ - [R=404]

# Front-controller
RewriteRule . index.html [L]

UPDATE: I removed the filesystem check entirely in favour of a rule that checks the format of the URL-path at the top of the file (ie. RewriteRule !^[^/.]+(/[^/.]+)?$ - [L]). If the requested URL-path does not (as denoted by the ! prefix) match a URL of the form /<value> or /<value>/<value> then the remaining directives are skipped entirely and the request is not rewritten to index.html.

!=site1 - The ! prefix negates the expression, so it is successful when the expression does not match. The = prefix-operator makes this a lexicographical string comparison (exact match), rather than a regex. Each of the conditions (RewriteCond directives) are implicitly AND'd. So the rule is triggered only when all conditions do not match.

Where $1 in each rule contains the value of the captured group from the RewriteRule pattern. In the first rule this contains the value of the first path-segment (the "site") and in the second rule this contains the value of the second path-segment (the "category").

If the "site" does not match in the first rule then a 404 is immediately triggered. If the "category" does not match in the second rule (only processed if the "site" is valid) then a 404 is immediately triggered.

RewriteRule ^([^/.]+) - [R=404]

This rule captures the first path segment in the URL-path and stores this in the $1 backreference. So, given a URL of the form /site/category (or just /site) then this captures site. This is then used in the preceding RewriteCond directives to validate that $1 contains one of the expected values. If all the preceding conditions are successful (ie. the first path segment does not match one of the permitted "sites") then a 404 is triggered. Note that I've restricted the path segment so it can no longer contain dots (this is the same pattern as used in the first rule).

RewriteRule ^[^/.]+/([^/.]+)$ - [R=404]

This is similar to above, except that it captures the second path segment (ie. the "category") in the $1 backreference. The preceding conditions then validate this. This is only processed if the preceding rule is not successful (ie. the first path segment matches a valid "site").

Define the custom error document as your index.html front-controller (in which you re-check the requested URL in JavaScript) then you can customise the response using JavaScript, as with all your other URLs (this should go at the top of the file):

ErrorDocument 404 /index.html

UPDATE / ERROR CORRECTION:

RewriteCond %{REQUEST_FILENAME} -f [OR]
RewriteCond %{REQUEST_FILENAME} -d [OR]
RewriteRule ^ - [L]

There should be no OR flag on the 2nd condition! I've corrected this in the code above.

The OR on the 2nd condition would have prevented the rest of the code from doing anything! So any site/category URL would have resulted in a 404.

MrWhite
  • 43,179
  • 8
  • 60
  • 84
  • Dear @MrWhite, thank you - again. Yes, I thought it's a better idea to have "only" 30+50 (80) URLs than 30*50 (1500), that's why I decided to use the // URL structure. I would have a wish: Can you please modify the code to support cases where only the / or the / is present? Both are valid as well. So www.example.com/ and www.example.com/ are also valid. Should I edit the original question with this modification request? Thank you in advance. – lpasztor Mar 27 '22 at 17:07
  • @lpasztor "to have "only" 30+50 (80) URLs than 30*50 (1500)" - That's 80 "directives" (or checks), you would still have 1500 possible URLs. Allowing both `/` _and_ `/` as well does complicate it somewhat (in order to avoid repetition). Do you have access to the main server config (or `` container)? Or are you limited to `.htaccess`? – MrWhite Mar 27 '22 at 18:28
  • Only `.htaccess`, plus all the possibilities of cPanel. And you are right, I expressed myself wrong, it's really still 1500 URLs. And I'm not sure if Google would like it at all. So maybe I should consider using either only `/` or `/`. And since people search mostly for categories (rather than sites), maybe that's better. So maybe in this case I should use the `# Validate "site" (first path segment)` part of code, but instead `site` use `category`. I think I update the question with this part as well. Thank you. – lpasztor Mar 27 '22 at 19:36
  • Unfortunately it's not working. :-/ I created a .htaccess file where only your code is present (copy-paste), but instead of opening the index.html and letting the URL untouched, it gives me error saying `The requested URL /site1/category1 was not found on this server`. Can you please maybe check your code again? Thank you. – lpasztor Apr 10 '22 at 22:01
  • @lpasztor Sorry, there was a typo/error... there should be no `OR` flag on the 2nd _condition_ near the top of the file! I've updated my answer. (Have you set `ErrorDocument 404 /index.html`? That should still have allowed your code to handle the request, except that the HTTP status would have been a 404.) – MrWhite Apr 10 '22 at 22:38
  • Unfortunately still not OK. Now it accepts all weird URLs like `www.example.com/123` or `www.example.com/////wrong-site`. Interestingly it won't accept `www.example.com/123/456`, says 404. So when there are 2 path segments in the URL, it accepts only the the right ones like `www.example.com/site1/category2`. So the code has issues only with the case when only the first path segment is given. In that case it should accept only first path segment "sites" like `www.example.com/site1`, right? Thank you. – lpasztor Apr 11 '22 at 06:07
  • @lpasztor I've updated the code in my answer. It should now only process URLs of the form `/site` or `/site/category`. If the URL does not match that format then it is not passed to `index.html`. Previously it would have only validated URLs of the form `/site/category`, but URLs of a different format would have erroneously dropped through to `index.html`. I've also limited `site` and `category` so they cannot contain dots (to avoid conflicts with actual files, without needing a filesystem check). – MrWhite Apr 12 '22 at 00:14
  • Sorry to bother you with it, but still not OK. Now it blocks all the resources, won't load any of my files (.css, .js, .jpg - nothing). And when I try to open the site with the index.html `www.example.com/index.html`, it says 404. Could you please check it? Thank you. – lpasztor Apr 12 '22 at 07:21
  • @lpasztor No worries and sorry I made another typo/error! I wrote the correct _comment_, but missed the `!` prefix on the expression! It should be negated. ie. `RewriteRule ^[^/.]+(/[^/.]+)?$ - [L]` should be `RewriteRule !^[^/.]+(/[^/.]+)?$ - [L]` - I've updated my answer and added more (possibly overlapping) explanation. However, I realise this still doesn't address your UPDATE#1 and #2 - I'll have a look at that later. – MrWhite Apr 12 '22 at 12:27
  • Now it looks much-much better, thank you. Please don't care about UPDATE#1 and #2 for now. So now the right site- or category names are opening only, that's good. But I can still see, that for example `www.example.com/////site1` is still accepted. `site1` is valid, but the `/////` should not be accepted. And regarding UPDATE#2: I would like to be able to open `www.example.com/category1` as well. (So from the second path segment.) But I can't add it simply to the first path segment, because in that case `www.example.com/category1/category1` would be also accepted. Which is not good. – lpasztor Apr 13 '22 at 15:52
  • To continue my previous comment: Also please note, I just realized, that when the good URL ends with `/`, it's not accepted, however that's a valid URL. So for example `www.example.com/site1` is accepted, but `www.example.com/site1/` is not, however it should, since it's a valid URL for the same page. Thank you. – lpasztor Apr 13 '22 at 19:16
  • Should I put all the open points together into an update to the original question? There would be still some fine tuning missing, also a finalized idea with some minor modifications regarding the original request, and I don't want to disturb you with my comments. Thank you. – lpasztor Apr 20 '22 at 11:29
  • Please check UPDATE #3. Thank you. – lpasztor Apr 20 '22 at 21:08
  • 1
    Wouldn't it be a lot simpler to have one rule: `RewriteRule ^/?(site1|site2|site3)/(category1|category2|category3)/?$ index.html [L]`? – Stephen Ostermiller Apr 21 '22 at 09:56
  • @StephenOstermiller Wow, nice and clean, thank you! However it still has one of the earlier issues: for some reason it accepts when multiple `/` characters are given. For example `www.example.com////site1/////////category1` is also accepted (redirects to index.html), however it should be 404. Can you please provide a fix for that? Thank you. – lpasztor Apr 23 '22 at 20:58
  • 1
    Merging into one rule is OK if you have just 3 sites and 3 categories, but in this case you have 30 sites and 50 categories - I would keep them as separate conditions (or perhaps merge into small groups - but readability may suffer. It's usually easier to scan/maintain a vertical list). The multiple slashes are an entirely separate issue. See [my answer](https://stackoverflow.com/a/69473909/369434) to the following question to resolve this. [Removing Double Slashes From URL By .htaccess does not work](https://stackoverflow.com/q/69469146/369434) – MrWhite Apr 23 '22 at 23:17
  • 1
    To add to what MrWhite just said, the multiple slashes are a quirk of Apache because it condenses multiple slashes to one before the rewrite rules are run. You can use a separate rule that has a condition that examines `%{THE_REQUEST}` which has the raw URL before Apache mucks with it. The rule to remove the multiple slashes should come *before* your other rules. (And despite MrWhite's "readability" concerns, I still prefer a single rule, even if it has fifty elements.) – Stephen Ostermiller Apr 23 '22 at 23:52
  • @MrWhite and @Stephen Ostermiller. As hopefully the last request in this topic: Can you please show me how would that 1 liner `RewriteRule ^/?(site1|site2|site3)/(category1|category2|category3)/?$ index.html [L]` look with separate conditions/groups? Just to see, which would be better for this 30+50-project. Thank you. – lpasztor Apr 24 '22 at 04:55
  • That is what MrWhite put in his answer. The separate conditions are like `RewriteCond $1 !=site1`. – Stephen Ostermiller Apr 24 '22 at 09:25
  • @StephenOstermiller Thank you. Actually I choose your solution (as a comment, I can't makr it as accepted answer), because as I wrote in the UPDATE#3, the solution from MrWhite needed one more change, and without that it's not working as should. But MrWhite thank you also for all of your help and patience to me. – lpasztor Apr 25 '22 at 21:39