I'm involved in writing a (Java/Groovy) browser-automation app with Selenium 2 and FireFox driver.
Currently there is an issue with some URLs we find in the wild that are apparently using bad URI syntax. (specifically curly braces ({}
), |
's and ^
's).
String url = driver.getCurrentUrl(); // http://example.com/foo?key=val|with^bad{char}acters
When trying to construct a java.net.URI
from the string returned by driver.getCurrentUrl()
a URISyntaxException
is thrown.
new URI(url); // java.net.URISyntaxException: Illegal character in query at index ...
Encoding the whole url
before constructing the URI
will not work (as I understand it).
The whole url is encoded, and it doesn't preseve any pieces of it that I can parse in any normal fashion. For example, with this uri-safe string, URI
can't know the difference between a &
as the query-string-param delimeter or %26
(its encoded value) in the content of a single qs-param.
String encoded = URLEncoder.encode(url, "UTF-8") // http%3A%2F%2Fexample.com%2Ffoo%3Fkey%3Dval%7Cwith%5E%7Cbad%7Ccharacters
URI uri = new URI(encoded)
URLEncodedUtils.parse(uri, "UTF-8") // []
Currently the solution is, before constructing the URI
, running the following (groovy) code:
["|", "^", "{", "}"].each {
url = url.replace(it, URLEncoder.encode(it, "UTF-8"))
}
But this seems dirty and wrong.
I guess my question is multi-part:
- Why does FirefoxDriver return a String rather than a URI?
- Why is this String malformed?
- What is best practice for dealing with this kind of thing?