68

How could I check if string has already been encoded?

For example, if I encode TEST==, I get TEST%3D%3D. If I again encode last string, I get TEST%253D%253D, I would have to know before doing that if it is already encoded...

I have encoded parameters saved, and I need to search for them. I don't know for input parameters, what will they be - encoded or not, so I have to know if I have to encode or decode them before search.

Trick
  • 3,779
  • 12
  • 49
  • 76

9 Answers9

55

Decode, compare to original. If it does differ, original is encoded. If it doesn't differ, original isn't encoded. But still it says nothing about whether the newly decoded version isn't still encoded. A good task for recursion.

I hope one can't write a quine in urlencode, or this algorithm would get stuck.

Exception: When a string contains "+" character url decoder replaces it with a space even though the string is not url encoded

Lonzak
  • 9,334
  • 5
  • 57
  • 88
SF.
  • 13,549
  • 14
  • 71
  • 107
  • You gave me the idea, how to do this. Now my SQL looks like `SELECT * FROM something WHERE param= " + param + " OR param = "+encode(param)` – Trick Feb 19 '10 at 11:30
  • How do you know that you don't need ``SELECT * FROM something WHERE param= " + param + " OR param = "+encode(param) + " OR param = "+encode(encode(param))``? That way lies infinite regress. – sverkerw Feb 19 '10 at 14:04
  • 1
    well, true except of a case where "good enough" is enough; if the 0.01% of users really want the program not to work, it won't work for them. Sometimes the extra, extreme clauses are just not worth the effort and the overhead. – SF. Feb 21 '10 at 12:19
  • 12
    This fails if your string contains windows variable names like `%DESCRIPTION%` which decodes to `ÞSCRIPTION%` or `%ABOUT%` which becomes `«OUT%`. – benrifkah Mar 22 '12 at 16:14
  • @benrifkah: true but then there is no way to tell them apart if the input is completely arbitrary. – SF. Mar 23 '12 at 11:44
  • @SF. Indeed. I posted the caveat to expose the issue so that people could act according to their needs. – benrifkah Mar 27 '12 at 19:50
  • 3
    @SF. : This will fail if the initial unencoded string contains a + character in the middle. The decoded string will contain a space character instead and it will not be equal. A better way would be to compare the lengths. If the original string is larger than the decoded string, then the original was encoded. – stan Sep 19 '13 at 14:23
  • @SF. : But my attempt above also doesn't say anything about whether the newly decoded version isn't encoded as well. – stan Sep 19 '13 at 14:24
  • 4
    It doesn't work if the raw string contains a plus sign. You decode it, compare to the original, and the strings are different. The + has been replaced with space. You end up not encoding it, even though you should. – ceiroa Mar 31 '14 at 17:35
  • You sir, saved me in a moment, when my brain has stopped and couldn't figure the following before reading your comment if( contactNumber != Uri.encode(contactNumber)){ contactNumber = Uri.encode(contactNumber); } – Kristian Ivanov Jun 28 '16 at 19:37
  • 5
    This is worng. When a string contains "+" character url decoder replaces it with a space even though the string is not url encoded. see http://docs.oracle.com/javase/6/docs/api/java/net/URLDecoder.html – Prabhath Suminda Oct 05 '16 at 08:49
  • 1
    This doesn't logically work if using java.net.URLDecode.decode(String, String) implementation. Reason: If the string contains "%xy" where x is a letter, such as "s". If you try to decode such an unencoded string, it results in throwing IllegalArgumentException("URLDecoder: Illegal hex characters in escape (%) pattern - negative value"). – Darrin Jan 27 '20 at 23:08
  • Actually, the logic of attempting to "decode" something that you have not passed through some filter logic is always just a bad design. This bad design encourages more bad design, such as choosing to try and catch all "Throwable" exceptions and ignore them. Doing that adds time to process and then ignore exceptions, or worse, hide an exception that could have been useful in diagnosing a real problem. Just bad, very bad. ;-) – Darrin Jan 27 '20 at 23:21
  • Does not work if url contains any query parameters with encoded urls inside. Like log-in links with redirect info inside or query params – Aliaksei Bulhak Apr 24 '20 at 13:53
  • This fails on urls containing none BASIC_LATIN characters too. – Tooraj Jam Jan 09 '21 at 08:04
21

Use regexp to check if your string contains illegal characters (i.e. characters which cannot be found in URL-encoded string, like whitespace).

Roman
  • 64,384
  • 92
  • 238
  • 332
  • I did not do this, but this is the solution. – Trick Feb 19 '10 at 11:32
  • 13
    So how will you differentiate between `hello%20world` and `interest20%growth` ? The first is a valid urlencoded string, the other is a string that has to be escaped and does not produce a valid unescape. – SF. Feb 19 '10 at 12:38
  • 2
    Checking for illegal characters does not include the percent symbol because it is not illegal it just gets escaped. When you check for the percent symbol you *may* have a URI encoded string if it is followed by "25". This only works if you know that your input is either not encoded or encoded exactly 1 time *and* that the input does not naturally include sequences that URI encoding generates. – benrifkah Mar 27 '12 at 20:02
  • Unfortunately, this was NOT the solution. I'm passing a URL as the url encrypted string, so I did an REFind(':', str) and it returns 6 (https:) whether the string is encrypted or not. – Chris Geirman Sep 02 '14 at 04:26
  • 3
    If a string contains invalid chars, you can prove it is not encoded, but if it contains only valid chars and percent signs, that does not prove that it is encoded. That is not knowable. So this may be as good a check as one can realistically do. – Paul Kienitz Oct 01 '15 at 20:44
6

Try decoding the url. If the resulting string is shorter than the original then the original URL was already encoded, else you can safely encode it (either it is not encoded, or even post encoding the url stays as is, so encoding again will not result in a wrong url). Below is sample pseudo (inspired by ruby) code:

# Returns encoded URL for any given URL after determining whether it is already encoded or not
    def escape(url)
      unescaped_url = URI.unescape(url)
      if (unescaped_url.length < url.length)
        return url
      else
        return URI.escape(url)
      end
    end
amit_saxena
  • 7,450
  • 5
  • 49
  • 64
  • this won't work if the url is encoded in the way that a `' '`(space) is replaced by a `'+'` because the length then stays the same – Florian K Jun 14 '18 at 11:48
  • Probably it's better to encode your URLs only as %20. The pros are described here: https://stackoverflow.com/a/2678602/762747 If that's not a possibility, then may be you can check for + signs after ?, and if you find any, then the URL is already encoded and you can return it as is. It's just an extra check to the above code, depending on your use case. – amit_saxena Jun 15 '18 at 12:26
3

You can't know for sure, unless your strings conform to a certain pattern, or you keep track of your strings. As you noted by yourself, a String that is encoded can also be encoded, so you can't be 100% sure by looking at the string itself.

flybywire
  • 261,858
  • 191
  • 397
  • 503
3

Check your URL for suspicious characters[1]. List of candidates:

WHITE_SPACE ,", < , > , { , } , | , \ , ^ , ~ , [ , ] , . and `

I use:

private static boolean isAlreadyEncoded(String passedUrl) {
        boolean isEncoded = true;
        if (passedUrl.matches(".*[\\ \"\\<\\>\\{\\}|\\\\^~\\[\\]].*")) {
                isEncoded = false;
        }
        return isEncoded;
}

For the actual encoding I proceed with:

https://stackoverflow.com/a/49796882/1485527

Note: Even if your URL doesn't contain unsafe characters you might want to apply, e.g. Punnycode encoding to the host name. So there is still much space for additional checks.


[1] A list of candidates can be found in the section "unsafe" of the URL spec at Page 2. In my understanding '%' or '#' should be left out in the encoding check, since these characters can occur in encoded URLs as well.

jschnasse
  • 8,526
  • 6
  • 32
  • 72
3

Using Spring UriComponentsBuilder:

import java.net.URI;
import org.springframework.web.util.UriComponentsBuilder;

private URI getProperlyEncodedUri(String uriString) {
    try {
        return URI.create(uriString);
    } catch (IllegalArgumentException e) {
        return UriComponentsBuilder.fromUriString(uriString).build().toUri();
    }
}
subject47
  • 31
  • 3
0

If you want to be sure that string is encoded correctly (if it needs to be encoded) - just decode and encode it once again.

metacode:

100%_correctly_encoded_string = encode(decode(input_string))

already encoded string will remain untouched. Unencoded string will be encoded. String with only url-allowed characters will remain untouched too.

esergion
  • 71
  • 7
  • 2
    Nope. Test it with an unencoded string containing "%s" in it. The exception will make code designed like this fail to execute due to InvalidArgumentException that is caused by an invalid "%xy" where xy are supposed to be hex digits. Same problem as the accepted answer, and one that tempts additional poor design flaws, such as ignoring unknown exception types. – Darrin Jan 27 '20 at 23:31
0

According to the spec (https://www.rfc-editor.org/rfc/rfc3986) all URLs MUST start with a scheme followed by a :

Since colons are required as the delimiter between a scheme and the rest of the URI, any string that contains a colon is not encoded.

(This assumes you will not be given an incomplete URI with no scheme.)

So you can test if the string contains a colon, if not, urldecode it, and if that string contains a colon, the original string was url encoded, if not, check if the strings are different and if so, urldecode again and if not, it is not a valid URI.

You can make this loop simpler if you know what schemes you can expect.

Community
  • 1
  • 1
Luke Mlsna
  • 468
  • 4
  • 16
0

Thanks to this answer I coded a function (JS Language) that encodes the URL just once with encodeURI so you can call it to make sure is encoded just once and you don't need to know if the URL is already encoded.

ES6:

var getUrlEncoded = sURL => {
    if (decodeURI(sURL) === sURL) return encodeURI(sURL)
    return getUrlEncoded(decodeURI(sURL))
}

Pre ES6:

var getUrlEncoded = function(sURL) {
    if (decodeURI(sURL) === sURL) return encodeURI(sURL)
    return getUrlEncoded(decodeURI(sURL))
}

Here are some tests so you can see the URL is only encoded once:

getUrlEncoded("https://example.com/media/Screenshot27 UI Home.jpg")
//"https://example.com/media/Screenshot27%20UI%20Home.jpg"
getUrlEncoded(encodeURI("https://example.com/media/Screenshot27 UI Home.jpg"))
//"https://example.com/media/Screenshot27%20UI%20Home.jpg"
getUrlEncoded(encodeURI(encodeURI("https://example.com/media/Screenshot27 UI Home.jpg")))
//"https://example.com/media/Screenshot27%20UI%20Home.jpg"
getUrlEncoded(decodeURI("https://example.com/media/Screenshot27 UI Home.jpg"))
//"https://example.com/media/Screenshot27%20UI%20Home.jpg"
getUrlEncoded(decodeURI(decodeURI("https://example.com/media/Screenshot27 UI Home.jpg")))
//"https://example.com/media/Screenshot27%20UI%20Home.jpg"
Alberto
  • 1,423
  • 18
  • 32