1

I'm trying to write a PHP function to validate a URL, which based on user input may or may not already be URL encoded.

I know from this answer that spaces should be encoded as such:

You should have %20 before the ? and + after.

The core failure in my function is the use of this:

!filter_var($url, FILTER_VALIDATE_URL) === false

Although this will return true with + in either the path or query string, I have no problem handling it in the path. I can easily split the path from the query string and return false if + is found in the path (requiring the user to decide on %20 or %2B).

But my question is what to do if I find + in the query string? How do I know if this is a proper use of an encoded space, or if it needs to be encoded as %2B?

Community
  • 1
  • 1
Jeff Puckett
  • 37,464
  • 17
  • 118
  • 167
  • You won't BELIEVE how much time I lost on this (encoding and unencoding filenames with a + in them). Good for me, I was uploading files to AWS S3. So what I ended up doing is stripping all the `+` from the filenames before I upload them. There, problem solved. :) – Sergio Tulentsev Jul 01 '16 at 16:20
  • 1
    Based on the specification you stated, any `+` present in the query string prior to the encoding is already a %2B. So if you find a `+`, then it was a space. – pah Jul 01 '16 at 16:21
  • @threadp: if only every tool adhered to the spec. This is faaaaaar from truth. – Sergio Tulentsev Jul 01 '16 at 16:22
  • @threadp yes, I have no problem with what happens *before* the query string, it's what happens *after* that I'm stuck on. – Jeff Puckett Jul 01 '16 at 16:22
  • @SergioTulentsev That's the problem with standards... there are so many we can choose from :) – pah Jul 01 '16 at 16:22
  • @JeffPuckettII To be honest, I guess that you have to assume a standard (whatever that is) and stick to it. I don't really like to do assumptions either but... what happens if you try to interpret the word 'no' without deciding in which language will you interpret it? It may mean a lot of stuff in any other language. – pah Jul 01 '16 at 16:27
  • @SergioTulentsev to clarify further, I'm trying to write a validator function that will be used to handle both encoded and unencoded URLs. due to user input, I don't know if it's encoded properly or not, hence the need for this function. – Jeff Puckett Jul 01 '16 at 16:31
  • Ah, just re-read your question. "How do I know if this is a proper use of an encoded space or if it needs to be encoded as %2B" - you don't. You could try to look for other percent-encoded entities around. If you find them, then this form is likely encoded and this is a space. But then again, how do you know that `%2B` is encoded `+` or should it be encoded as `%252B`? Good luck! :) – Sergio Tulentsev Jul 01 '16 at 16:32
  • You could try actually loading something by that url. If you get an HTTP 200, the url is fine. :) – Sergio Tulentsev Jul 01 '16 at 16:34
  • This is a similar problem to validating email address format. The standard is crazy complex, so most apps do not bother with implementing it in full. They just do the bare minimum (check that `@` is there, or something) and then send you an _actual email_. If you clicked the verification link, the email address must be valid. – Sergio Tulentsev Jul 01 '16 at 16:36

0 Answers0