I'm trying to clean up some auto-generated code where input URL fragments:
- may include spaces, which need to be
%
-escaped (as%20
, not+
) - may include other URL-invalid characters, which also need to be
%
-escaped - may include path separators, which need to be left alone (
/
) - may include already-escaped components, which need not to be doubly-escaped
The existing code uses libcurl
(via Typhoeus and Ethon), which like command-line curl
seems to happily accept spaces in URLs.
The existing code is all string-based and has a number of shenanigans involving removing extra slashes, adding missing slashes, etc. I'm trying to replace this with URI.join()
, but this fails with bad URI(is not URI?)
on the fragments with spaces.
The obvious solution is to use the (deprecated) URI.escape
, which escapes spaces, but leaves slashes alone:
URI.escape('http://example.org/ spaces /<"punc^tu`ation">/non-ascïï /&c.')
# => "http://example.org/%20spaces%20/%3C%22punc%5Etu%60ation%22%3E/non-asc%C3%AF%C3%AF%20%F0%9D%96%88%F0%9D%96%8D%F0%9D%96%86%F0%9D%96%97%F0%9D%96%98/%EF%BC%86%EF%BD%83%EF%BC%8E"
This mostly works, except for case (3) above — previously escaped components get double-escaped.
s1 = URI.escape(s)
# => "http://example.org/%20spaces%20/%3C%22punc%5Etu%60ation%22%3E/non-asc%C3%AF%C3%AF%20%F0%9D%96%88%F0%9D%96%8D%F0%9D%96%86%F0%9D%96%97%F0%9D%96%98/%EF%BC%86%EF%BD%83%EF%BC%8E"
URI.escape(s)
# => "http://example.org/%2520spaces%2520/%253C%2522punc%255Etu%2560ation%2522%253E/non-asc%25C3%25AF%25C3%25AF%2520%25F0%259D%2596%2588%25F0%259D%2596%258D%25F0%259D%2596%2586%25F0%259D%2596%2597%25F0%259D%2596%2598/%25EF%25BC%2586%25EF%25BD%2583%25EF%25BC%258E"
The recommended alternatives to URI.escape
, e.g. CGI.escape
and ERB::Util.url_encode
, are not suitable as they mangle the slashes (among other problems):
CGI.escape(s)
# => "http%3A%2F%2Fexample.org%2F+spaces+%2F%3C%22punc%5Etu%60ation%22%3E%2Fnon-asc%C3%AF%C3%AF+%F0%9D%96%88%F0%9D%96%8D%F0%9D%96%86%F0%9D%96%97%F0%9D%96%98%2F%EF%BC%86%EF%BD%83%EF%BC%8E"
ERB::Util.url_encode(s)
# => "http%3A%2F%2Fexample.org%2F%20spaces%20%2F%3C%22punc%5Etu%60ation%22%3E%2Fnon-asc%C3%AF%C3%AF%20%F0%9D%96%88%F0%9D%96%8D%F0%9D%96%86%F0%9D%96%97%F0%9D%96%98%2F%EF%BC%86%EF%BD%83%EF%BC%8E"
Is there a clean, out-of-the-box way to preserve existing slashes, escapes, etc. and escape only invalid characters in a URI string?
So far the best I've been able to come up with is something like:
include URI::RFC2396_Parser::PATTERN
INVALID = Regexp.new("[^%#{RESERVED}#{UNRESERVED}]")
def escape_invalid(str)
parser = URI::RFC2396_Parser.new
parser.escape(str, INVALID)
end
This seems to work:
s2 = escape_invalid(s)
# => "http://example.org/%20spaces%20/%3C%22punc%5Etu%60ation%22%3E/non-asc%C3%AF%C3%AF%20%F0%9D%96%88%F0%9D%96%8D%F0%9D%96%86%F0%9D%96%97%F0%9D%96%98/%EF%BC%86%EF%BD%83%EF%BC%8E"
s2 == escape_invalid(s2)
# => true
but I'm not confident in the regex concatenation (even if it is the way URI::RFC2396_Parser
works internally) and I know it doesn't handle all cases (e.g., a %
that isn't part of a valid hex escape should probably be escaped). I'd much rather find a library standard solution.