2

I have the following URL I want to escape :

http://BUCKET_ENDPOINT/PATH_1/PATH_2/PATH_3/PATH_4/PATH_5/TEST NAME COULD BE WITH & AND OTHER SPECIAL CHARS.zip

I haven't found so far how to encode this string to match both storing in an HTML and encoded as a URL, e.g. '&' should be replaced with #26, space should be replaced with #20, etc

Java's URLEncoder will, for example, replace the spaces with a '+' sign, which isn't what I'm looking for

Ohad Benita
  • 533
  • 2
  • 8
  • 26

2 Answers2

2

I haven't found so far how to encode this string to match both storing in an HTML and encoded as a URL

That's because there isn't any, since those are two separate things.

Printing in HTML should generally be done by replacing only ', ", <, > and & with &apos;, &quot;, &lt;, &gt; and &amp;. Here are examples doing that: Recommended method for escaping HTML in Java, the most trivial and easiest to reason with being

public static String encodeToHTML(String str) {
    return str
        .replace("'",  "&apos;")
        .replace("\"", "&quot;")
        .replace("<",  "&lt;")
        .replace(">",  "&gt;")
        .replace("&",  "&amp;");
}

Note that you need to have matching character set in your page, and be aware that if you for example print the url in an attribute field, requirements are a bit different.

Encoding as an url allows for a lot shorter list of characters. From URLEncoder documentation:

The alphanumeric characters "a" through "z", "A" through "Z" and "0" through "9" remain the same.

The special characters ".", "-", "*", and "_" remain the same.

The space character " " is converted into a plus sign "+".

All other characters are unsafe and are first converted into one or more bytes using some encoding scheme. Then each byte is represented by the 3-character string "%xy", where xy is the two-digit hexadecimal representation of the byte.

The recommended encoding scheme to use is UTF-8.

You'd get those with

String encoded = new java.net.URLEncoder.encode(url, "UTF-8");

The above will give you HTML form encoding, which is close to what url encoding does, with a few noteable differences, the most relevant being + vs %20. For that, you can do this on its output:

String encoded = encoded.replace("+", "%20");

Note also that you don't want to use url encoding for the whole http://BUCKET_ENDPOINT/PATH_1/PATH_2/PATH_3/PATH_4/PATH_5/TEST NAME COULD BE WITH & AND OTHER SPECIAL CHARS.zip, but to the last part of it, TEST NAME COULD BE WITH & AND OTHER SPECIAL CHARS.zip, and the individual path segments if they are not fixed.

If you are in a position that you need to generate the url and print it in html, first encode it as an url, then do html escaping.

eis
  • 51,991
  • 13
  • 150
  • 199
0

Since I already know that the path part of the URL doesn't need special escaping I decided to go with the solution proposed here to encode only the zip file name part which answers the need in this case

 String urlEscaped = URLEncoder.encode(URL_TO_ESCAPE, "UTF-8")
            .replaceAll("\+", "%20")
            .replaceAll("\%21", "!")
            .replaceAll("\%27", "'")
            .replaceAll("\%28", "(")
            .replaceAll("\%29", ")")
            .replaceAll("\%7E", "~");
Ohad Benita
  • 533
  • 2
  • 8
  • 26
  • "Since I already know that the path part of the URL doesn't need special escaping" .. well, why didn't you mention it in the question then? :) Also, there's really no need to use regular expressions here. you can just use replace() instead of replaceAll() and get rid of regular expression escapes. – eis May 29 '18 at 06:40