4

I just want to check my own sanity with this question here. I have a filename which has a + (plus) character in it, which is perfectly valid on some operating systems and filesystems (e.g. MacOS and HFS+).

However, I am seeing an issue where I think that java.io.File#toURI() is not operating correctly.

For example:

new File("hello+world.txt").toURI().toString()

On my Mac machine returns:

file:/Users/aretter/code/rocksdb/hello+world.txt

However IMHO, that is not correct, because the + (plus) character from the filename has not been encoded in the URI. The URI does not represent the original filename at all, a + in a URI has a very different meaning to a + character in a filename.

So if we decode the URI, the plus will now be replaced with a (space) character, and we have lost information. e.g.:

URLDecoder.decode(new File("hello+world.txt").toURI().toURL().toString)

Which results in:

file:/Users/aretter/code/rocksdb/hello world.txt

What I would have expected instead would be something like:

new File("hello+world.txt").toURI().toString()

resulting in:

file:/Users/aretter/code/rocksdb/hello%2Bworld.txt

So that when it is later used and decoded the plus sign is preserved.

I am struggling to believe that such an obvious bug could be present in Java SE. Can someone point out where I am mistaken?

Also, if there is a workaround, I would like to hear about it please? Keep in mind that I am not actually providing static strings as filenames to File, but rather reading a directory of files from disk, of which some of those files may contain a + (plus) character.

adamretter
  • 3,885
  • 2
  • 23
  • 43
  • If I get your question correctly, you want `hello+world.txt` to be displayed as `hello%2Bworld.txt` – Ravi Oct 23 '17 at 18:36

5 Answers5

3

Let me try to clarify,

  • '+' plus character is used as encoding character to encode ' ' space in context of HTML form (a.k.a. application/x-www-form-urlencoded MIME format).
  • '%20' character is used as encoding character to encode ' ' space in context of URL/URI format.

'+' plus character is threat as a normal character in context of URL and it is not encoded in any form (e.g. %20).

So when you call the new File("hello+world.txt").toURI().toString() does not perform any encoding for '+' character(simply because it is not required).

Now come to URLDecoder, this class is an utility class for HTML form decoding. It treat the '+' plus as encoded character and hence decode it to ' ' space character. In your example, this class tread the URI's to string value as normal html form field's value (not the URI value). This class should never be used to decode the full URI/URL value as it is not designed for this purpose)

From java docs of URLDecoder#decode(String),

Decodes a x-www-form-urlencoded string. The platform's default encoding is used to determine what characters are represented by any consecutive sequences of the form "%xy".

Hope it helps.

Update #1 based on comments:

As per section 2.2, If data for a URI component has conflicts with a reserved character, then the conflicting data must be percent-encoded before the URI is formed.

It is also an important point that different parts of URI has different set of reserved words depending on the their context. For example, / sign is reserved only in path part of URI, + sign is reserved in query string part. So there is no need to escape / in query part and similarly there is no need to escape + in path part.

In your example, URI producer File.toURI does not encode + sign in path part of URI (since +' is not considered as reserved word in path part) and you see the +' sign in to URI's to string representation.

You may refers to URI recommendation for more details.

Related answer:

  1. https://stackoverflow.com/a/1006074/1700467
  2. https://stackoverflow.com/a/2678602/1700467
  3. https://stackoverflow.com/a/4571518/1700467
Community
  • 1
  • 1
skadya
  • 4,330
  • 19
  • 27
  • My issue is with `new File("hello+world.txt").toURI()` not escaping the + in the URI, because a plus in a filename needs to be encoded when converting to a URI to preserve it. – adamretter Oct 23 '17 at 22:27
  • @adamretter But that is exactly his point, the `+` character **does not** have to be encoded. Read e.g. the [URI specification BNF](https://www.w3.org/Addressing/URL/5_URI_BNF.html). There the `+` is sort of special, but only because it has special meaning in `search` and is only valid in `path`. But that's it, the `+` here is part of `path`, it is valid and doesn't need to be encoded. (would you include e.g. `?` that would be different) – Cryptjar Oct 24 '17 at 00:31
  • I don't agree. RFC 3986 section 2.2 clearly shows that `+` is a reserved character in a URI: https://tools.ietf.org/html/rfc3986#section-2.2. In particular and I quote from the RFC: 1) "URIs that differ in the replacement of a reserved character with its corresponding percent-encoded octet are not equivalent.", and 2) "URI producing applications should percent-encode data octets that correspond to characters in the reserved set". In this instance I consider `File.toURI` to be a URI producing application. – adamretter Oct 24 '17 at 08:45
  • Thanks for your comprehensive answer. However, I am still not convinced that the `+` should not be encoded. Whilst a `+` is allowed in the `pchar` part of the URI defined in the ABNF for URI https://tools.ietf.org/html/rfc3986#appendix-A, so are percent encoded characters, the ABNF represents the encoded form. As the spec says in section 2.4: "Once produced, a URI is always in its percent-encoded form." i.e. a URI is always in the encoded form, and a `+` in a URI path represents a space character and not a `+` character if I am not mistaken. The ABNF, and section 2.2 and 2.4 seem to confirm? – adamretter Oct 27 '17 at 15:41
  • 1
    Ahh!!! Okay, so having re-read your updated answer and looked at a couple of times https://stackoverflow.com/a/4571518/1700467 I think I now understand. The problem here is really `URLDecoder` as it is designed for a very specific HTML use-case. Thanks for the great answer. – adamretter Oct 27 '17 at 15:57
1

I'm assuming, you wanted to encode + sign in your filename to %2B. So, that you get back it as + sign when you decode it back.

If that is the case, then you need to use URLEncoder.encode

System.out.println(URLEncoder.encode(new File("hello+world.txt").toURI().toString()));

It will encode all special characters including + sign. The output would be

file%3A%2Fhome%2FT8hvs7%2Fhello%2Bworld.txt

Now, to decode use URLDecoder.decode

System.out.println(URLDecoder.decode("file%3A%2Fhome%2FwQCXni%2Fhello%2Bworld.txt"));

It will display

file:/home/wQCXni/hello+world.txt
Ravi
  • 30,829
  • 42
  • 119
  • 173
0

Obviously this is not a bug, documentation clearly says

The plus sign "+" is converted into a space character " " .

You can do something like that: https://ideone.com/JHDkM4

import java.util.*;
import java.lang.*;
import java.io.*;
import static java.lang.System.out;


class Ideone
{
    public static void main (String[] args) throws java.lang.Exception
    {
        out.println(new File("hello+world.txt").toURI().toString());
        out.println(java.net.URLDecoder.decode(new File("hello+world.txt").toURI().toURL().toString()));
        out.println(new File("hello+world.txt").toURI().toString().replaceAll("\\+", "%2B"));
    }
}
Kamil Witkowski
  • 1,978
  • 1
  • 19
  • 33
  • My issue is with `new File("hello+world.txt").toURI()` not escaping the + in the URI, because a plus in a filename needs to be encoded when converting to a URI to preserve it. – adamretter Oct 23 '17 at 22:27
0

If the URI represents a file, let the File class decode the URI.

Let's say we have a URI for a file, for example to get the filepath of a jar file : URI uri = MyClass.class.getProtectionDomain().getCodeSource().getLocation().toURI();

System.out.println(uri.toString());
=> BAD : will display the plus sign, but %20 for spaces

System.out.println(URLDecoder.decode(uri.toString(), StandardCharsets.UTF_8.toString()));
=> BAD : will display spaces instead of %20, but also instead of the plus sign

System.out.println(new File(uri).getAbsolutePath());
=> GOOD

eisnard
  • 41
  • 4
-1

Try to escape the plus sign with a backslash \ So do

new File("hello\+world.txt").toURI().toString()
Dinh
  • 759
  • 5
  • 16
  • That doesn't even compile! – adamretter Oct 09 '17 at 14:32
  • it does if you use it properly, its your code just with a \ inside of the String it does compile and work – Dinh Oct 09 '17 at 18:21
  • 1
    import java.io.File; public class Test { public static void main(String args[]) { new File("hello\+world.txt").toURI().toString(); } } javac Test.java Test.java:5: error: illegal escape character – adamretter Oct 10 '17 at 19:28