17

I'm testing PHP urlencode() vs. Java java.net.URLEncoder.encode().

Java

String all = "";
for (int i = 32; i < 256; ++i) {
    all += (char) i;
}

System.out.println("All characters:         -||" + all + "||-");
try {
    System.out.println("Encoded characters:     -||" + URLEncoder.encode(all, "utf8") + "||-");
} catch (UnsupportedEncodingException e) {
    e.printStackTrace();
}

PHP

$all = "";
for($i = 32; $i < 256; ++$i)
{
    $all = $all.chr($i);
}

echo($all.PHP_EOL);
echo(urlencode(utf8_encode($all)).PHP_EOL);

All characters seem to be encoded in the same way with both functions, except for the 'asterisk' character that is not encoded by Java, and translated to %2A by PHP. Which behaviour is supposed to be the 'right' one, if any?

Note: I tried with rawurlencode(), too - no luck.

Francisco R
  • 4,032
  • 1
  • 22
  • 37
etienne
  • 3,146
  • 1
  • 24
  • 45
  • I've asked a [similar question](http://stackoverflow.com/questions/25085992/when-should-an-asterisk-be-encoded-in-an-http-url) to try to get a more comprehensive answer. – Riley Major Aug 01 '14 at 18:18

3 Answers3

11

It is okay to have a * in a URL, (but it is also okay to have it in its encoded form).

RFC1738: Uniform Resource Locators (URL) states the following:

Reserved:

[...]

Usually a URL has the same interpretation when an octet is represented by a character and when it encoded. However, this is not true for reserved characters: encoding a character reserved for a particular scheme may change the semantics of a URL.

Thus, only alphanumerics, the special characters "$-_.+!*'(),", and reserved characters used for their reserved purposes may be used unencoded within a URL.

On the other hand, characters that are not required to be encoded (including alphanumerics) may be encoded within the scheme-specific part of a URL, as long as they are not being used for a reserved purpose.

aioobe
  • 413,195
  • 112
  • 811
  • 826
  • +1 In fact, from the set `$-_.+!*'(),`, Java uses only `-_.*` in unencoded form: http://docs.oracle.com/javase/7/docs/api/java/net/URLEncoder.html – caw Feb 19 '14 at 05:08
  • 2
    And the only difference between Java and PHP seems to be the asterisk: PHP uses `%2A` while Java uses `*`. – caw Feb 19 '14 at 05:14
8

Wikipedia suggests that * is a reserved character when it comes to URIs, and that it must be encoded if not used for the reserved purpose. According to RFC3986, pages 12-13:

URIs include components and subcomponents that are delimited by characters in the "reserved" set. These characters are called "reserved" because they may (or may not) be defined as delimiters by the generic syntax, by each scheme-specific syntax, or by the implementation-specific syntax of a URI's dereferencing algorithm. If data for a URI component would conflict with a reserved character's purpose as a delimiter, then the conflicting data must be percent-encoded before the URI is formed.

  reserved    = gen-delims / sub-delims

  gen-delims  = ":" / "/" / "?" / "#" / "[" / "]" / "@"

  sub-delims  = "!" / "$" / "&" / "'" / "(" / ")"
              / "*" / "+" / "," / ";" / "="

(The fact that the URL RFC still allows the * character to go unencoded is that is doesn't have a reserved purpose i URLs, and as such doesn't have to be encoded. So wether you have to encode it or not depends on what sort of URI you're creating.)

Community
  • 1
  • 1
You
  • 22,800
  • 3
  • 51
  • 64
  • Could you please include the quote from the page that states that `*` should be encoded? – aioobe Jun 30 '11 at 10:54
  • @aioobe: Done. There seems to be a discrepancy between the URL and URI RFCs, where the URL RFC in effect overrides the URI RFC requirement to encode `*`. So the answer really depends on what kind of URI you're creating. – You Jun 30 '11 at 11:01
  • 2
    `urlencode` and `java.net.URLEncoder` indicates that he's after a URL though. – aioobe Jun 30 '11 at 11:03
  • 1
    RFC3986 explicitly states that it updates RFC1738, so I would think that any inconsistency would be resolved in favor of RFC3986. RFC3986 says that a URL is an example of a URI, and if URIs must have the asterisk encoded, then the URL should as well. But various online tools do it differently (see, e.g., http://meyerweb.com/eric/tools/dencoder/ and http://www.url-encode-decode.com/.) – Riley Major Aug 01 '14 at 16:20
2

Javadoc of URLEncoder refers to the HTML specification:

This class contains static methods for converting a String to the application/x-www-form-urlencoded MIME format. For more information about HTML form encoding, consult the HTML specification.

HTML4 is quite unclear regarding this question and refers to RFC1738, which is quoted by aioobe:

Control names and values are escaped. Space characters are replaced by '+', and then reserved characters are escaped as described in [RFC1738], section 2.2: Non-alphanumeric characters are replaced by '%HH', a percent sign and two hexadecimal digits representing the ASCII code of the character. Line breaks are represented as "CR LF" pairs (i.e., '%0D%0A').

However, HTML5 directly states that * should not be encoded:

  • If the character isn't in the range U+0020, U+002A, U+002D, U+002E, U+0030 to U+0039, U+0041 to U+005A, U+005F, U+0061 to U+007A
    Replace the character with a string formed as follows:
    ...
  • Otherwise
    Leave the character as is.
axtavt
  • 239,438
  • 41
  • 511
  • 482