0

I'm trying to clean up some auto-generated code where input URL fragments:

  1. may include spaces, which need to be %-escaped (as %20, not +)
  2. may include other URL-invalid characters, which also need to be %-escaped
  3. may include path separators, which need to be left alone (/)
  4. may include already-escaped components, which need not to be doubly-escaped

The existing code uses libcurl (via Typhoeus and Ethon), which like command-line curl seems to happily accept spaces in URLs.

The existing code is all string-based and has a number of shenanigans involving removing extra slashes, adding missing slashes, etc. I'm trying to replace this with URI.join(), but this fails with bad URI(is not URI?) on the fragments with spaces.

The obvious solution is to use the (deprecated) URI.escape, which escapes spaces, but leaves slashes alone:

URI.escape('http://example.org/ spaces /<"punc^tu`ation">/non-ascïï /&c.')
# => "http://example.org/%20spaces%20/%3C%22punc%5Etu%60ation%22%3E/non-asc%C3%AF%C3%AF%20%F0%9D%96%88%F0%9D%96%8D%F0%9D%96%86%F0%9D%96%97%F0%9D%96%98/%EF%BC%86%EF%BD%83%EF%BC%8E" 

This mostly works, except for case (3) above — previously escaped components get double-escaped.

s1 = URI.escape(s)
# => "http://example.org/%20spaces%20/%3C%22punc%5Etu%60ation%22%3E/non-asc%C3%AF%C3%AF%20%F0%9D%96%88%F0%9D%96%8D%F0%9D%96%86%F0%9D%96%97%F0%9D%96%98/%EF%BC%86%EF%BD%83%EF%BC%8E"
URI.escape(s)
# => "http://example.org/%2520spaces%2520/%253C%2522punc%255Etu%2560ation%2522%253E/non-asc%25C3%25AF%25C3%25AF%2520%25F0%259D%2596%2588%25F0%259D%2596%258D%25F0%259D%2596%2586%25F0%259D%2596%2597%25F0%259D%2596%2598/%25EF%25BC%2586%25EF%25BD%2583%25EF%25BC%258E" 

The recommended alternatives to URI.escape, e.g. CGI.escape and ERB::Util.url_encode, are not suitable as they mangle the slashes (among other problems):

CGI.escape(s)
# => "http%3A%2F%2Fexample.org%2F+spaces+%2F%3C%22punc%5Etu%60ation%22%3E%2Fnon-asc%C3%AF%C3%AF+%F0%9D%96%88%F0%9D%96%8D%F0%9D%96%86%F0%9D%96%97%F0%9D%96%98%2F%EF%BC%86%EF%BD%83%EF%BC%8E"
ERB::Util.url_encode(s)
# => "http%3A%2F%2Fexample.org%2F%20spaces%20%2F%3C%22punc%5Etu%60ation%22%3E%2Fnon-asc%C3%AF%C3%AF%20%F0%9D%96%88%F0%9D%96%8D%F0%9D%96%86%F0%9D%96%97%F0%9D%96%98%2F%EF%BC%86%EF%BD%83%EF%BC%8E"

Is there a clean, out-of-the-box way to preserve existing slashes, escapes, etc. and escape only invalid characters in a URI string?

So far the best I've been able to come up with is something like:

include URI::RFC2396_Parser::PATTERN

INVALID = Regexp.new("[^%#{RESERVED}#{UNRESERVED}]")

def escape_invalid(str)
  parser = URI::RFC2396_Parser.new
  parser.escape(str, INVALID)
end

This seems to work:

s2 = escape_invalid(s)
# => "http://example.org/%20spaces%20/%3C%22punc%5Etu%60ation%22%3E/non-asc%C3%AF%C3%AF%20%F0%9D%96%88%F0%9D%96%8D%F0%9D%96%86%F0%9D%96%97%F0%9D%96%98/%EF%BC%86%EF%BD%83%EF%BC%8E"
s2 == escape_invalid(s2)
# => true 

but I'm not confident in the regex concatenation (even if it is the way URI::RFC2396_Parser works internally) and I know it doesn't handle all cases (e.g., a % that isn't part of a valid hex escape should probably be escaped). I'd much rather find a library standard solution.

David Moles
  • 48,006
  • 27
  • 136
  • 235
  • https://stackoverflow.com/a/13059657/128421 might be helpful. – the Tin Man Feb 06 '20 at 00:24
  • 1
    Where did you get those strings? Maybe we can just assume that the strings containing % are already escaped, and escape only those that have no %. – Aetherus Feb 06 '20 at 01:17
  • So, taking your 1/2/3 points at the top, you just want to handle spaces, and leave everything else alone. BTW, URL encoding uses `+` for spaces, although `%20` works too. So... why not just `s.gsub(' ', '+')`? – Amadan Feb 06 '20 at 02:50
  • @Amadan the example I gave involves spaces, but I also need to handle `@`, non-ASCII characters, etc. I've updated accordingly. – David Moles Feb 07 '20 at 17:17
  • @theTinMan I read that. It was helpful in reinforcing my general instinct to handle URL paths and query parameters separately. – David Moles Feb 07 '20 at 17:20
  • @Aetherus That might not be a bad heuristic. – David Moles Feb 07 '20 at 17:20

0 Answers0