Header 404 vs Header 400: url parsing error

Question

I'm writing my own little php framework. I want to write everything as semantic as it could be, and I'm stacked.

I've got an url parsing class. It parse the whole url (scheme, subdomain, domain, resource and query). Next the router class decides what to do with this url. If there are resources corresponding to url it "renders" it, if not it render 404, if resource is forbidden it renders 403, etc... What is the problem:

Let's say that my site is under: http://en.mysite.com. Lets say that pages asd and &*% does not exist. So I've got 2 url's:

http://en.mysite.com/asd
http://en.mysite.com/&*%($^&#

Of course both sites doesn't exists. But what should the headers look like? I'm predicting that:

http://en.mysite.com/asd // header 404 Page not found
http://en.mysite.com/&*% // header 400 Bad request

However (based on our guru site):

http://stackoverflow.com/<<            // header 404
http://stackoverflow.com/&;:           // header 404
http://stackoverflow.com/&*%($%5E&#    // header 400 (which btw is not styled...)
https://www.google.com/%&*(#$*%&@^     // header 404...

What is the rule? Should every system predict which symbols are ok for url? As for me url should containt only [a-z0-9-_.#!]+. I'm using slashes as paramters, so I dont need ? = &. But what is the general rule? Are there any url regex in specification?

BTW: For those who will say put 404 and go drink bear: I probably will :).

But this problem is kind of serious in case of SEO. As 400 is quite not the same as 404 in case of positioning. And it is nice to style 400 page Your own way, and say to someone not "page not found" but "are you trying to inject something into my beautiful url? It is a BAD REQUEST!

Its up to you what your system decides to be "bad". Theres nothing spricfied in RFC2616 http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html. But you can use RFC3986 to analyse if the URL is formed correctly. — Inceddy, Jul 22 '15 at 14:10
Thx for your reply and direct w3.org link. According to it really everything is left to the server... But this sentence is interesting: 400: "The client SHOULD NOT repeat the request without modifications". Does it mean that in modern browsers this header is cached browser side, and all future requests are not even send? A little off topic but maybe 404 is better in that case... BR! — Jacek Kowalewski, Jul 22 '15 at 14:16

score 2 · Accepted Answer · edited May 23 '17 at 11:58

2

As far as I can tell from the IETF RFC2616, 400 should be returned for requests that are mallformed (i.e. do not conform to the IETF RFC3986, whereas 404 should be returned for resources that do not exist (410 should be returned for resources that once existed but have now gone).

In the above examples URL's with a %-sign not followed by two hexadecimal characters are definitely mallformed (e.g. en.mysite.com/&%($^&# and www.google.com/%&(#$*%&@^). Also malformed are queries that have two ?(question mark signs) in the last part.

A regular expression for URLs can be found in response to the question: PHP validation/regex for URL.

edited May 23 '17 at 11:58

Community

1
1

answered Jul 22 '15 at 14:44

joosts

154
1
6

1

Thanks for your answer. I think it dispels all my doubts. If you don't mind I will wait with the accept click a little, maybe someone is creating an "encyclopedic" answer rigth now :). +1 from me. – Jacek Kowalewski Jul 22 '15 at 14:57

Header 404 vs Header 400: url parsing error

1 Answers1