19

How do I encode or 'escape' the URL before I use OpenURI to open(url)?

We're using OpenURI to open a remote url and return the xml:

getresult = open(url).read

The problem is the URL contains some user-input text that contains spaces and other characters, including "+", "&", "?", etc. potentially, so we need to safely escape the URL. I saw lots of examples when using Net::HTTP, but have not found any for OpenURI.

We also need to be able to un-escape a similar string we receive in a session variable, so we need the reciprocal function.

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
jpw
  • 18,697
  • 25
  • 111
  • 187

4 Answers4

35

Don't use URI.escape as it has been deprecated in 1.9.

Rails' Active Support adds Hash#to_query:

 {foo: 'asd asdf', bar: '"<#$dfs'}.to_query
 # => "bar=%22%3C%23%24dfs&foo=asd+asdf"

Also, as you can see it tries to order query parameters always the same way, which is good for HTTP caching.

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
Ernest
  • 8,701
  • 5
  • 40
  • 51
  • in rails 4.2 i notice this runs the following code: "#{CGI.escape(key.to_param)}=#{CGI.escape(to_param.to_s)}" – Ed_ Mar 17 '15 at 22:35
  • 2
    @Ed_, thank you - I've pasted invalid link to Object#to_query, where it should be Hash#to_query. – Ernest Mar 17 '15 at 23:07
15

Ruby Standard Library to the rescue:

require 'uri'
user_text = URI.escape(user_text)
url = "http://example.com/#{user_text}"
result = open(url).read

See more at the docs for the URI::Escape module. It also has a method to do the inverse (unescape)

Jacob
  • 22,785
  • 8
  • 39
  • 55
  • also very helpful thank you. not sure whether I'll use uri or addressable. Thanks! – jpw Feb 11 '11 at 18:25
  • 2
    Oh, just saw that URI.encode takes full URL. No wonder it's giving problems. So... Don't use it ;) – Jacob Feb 11 '11 at 19:43
  • WTF? URI.encode has an absolutely impossible spec; there's no way to identify the unescaped portions of a string - this is just a security leak waiting to happen. – Eamon Nerbonne Jan 20 '14 at 16:21
8

The main thing you have to consider is that you have to escape the keys and values separately before you compose the full URL.

All the methods which get the full URL and try to escape it afterwards are broken, because they cannot tell whether any & or = character was supposed to be a separator, or maybe a part of the value (or part of the key).

The CGI library seems to do a good job, except for the space character, which was traditionally encoded as +, and nowadays should be encoded as %20. But this is an easy fix.

Please, consider the following:

require 'cgi'

def encode_component(s)
  # The space-encoding is a problem:
  CGI.escape(s).gsub('+','%20')
end

def url_with_params(path, args = {})
  return path if args.empty?
  path + "?" + args.map do |k,v|
    "#{encode_component(k.to_s)}=#{encode_component(v.to_s)}" 
  end.join("&")
end

def params_from_url(url)
  path,query = url.split('?',2)
  return [path,{}] unless query
  q = query.split('&').inject({}) do |memo,p|
    k,v = p.split('=',2)
    memo[CGI.unescape(k)] = CGI.unescape(v)
    memo
  end
  return [path, q]
end

u = url_with_params( "http://example.com",
                            "x[1]"  => "& ?=/",
                            "2+2=4" => "true" )

# "http://example.com?x%5B1%5D=%26%20%3F%3D%2F&2%2B2%3D4=true"

params_from_url(u)
# ["http://example.com", {"x[1]"=>"& ?=/", "2+2=4"=>"true"}]
the Tin Man
  • 158,662
  • 42
  • 215
  • 303
Arsen7
  • 12,522
  • 2
  • 43
  • 60
  • 2
    don't use CGI.escape, it goes against the spec and converts spaces to + instead of %20 – bluesmoon May 08 '12 at 17:12
  • I do not understand. The `+` is perfectly good when we talk about escaping space in a URI, I believe. Why do you think it is not supposed to be used? – Arsen7 May 09 '12 at 10:06
  • 1
    Arsen7, the + is deprecated. It's what was used in the old CGI days before URL encoding was standardized. The only reason + still works is because of backwards compatibility. – bluesmoon May 16 '12 at 10:14
  • 1
    Well, you are perfectly right, but the problem is that there was no other reliable way to properly escape an URI component at the time. The `CGI::escape` does everything properly, except the `+`, and you probably can just do a `gsub` on the result. But if you are using **ruby 1.9+**, then it seems like the function `URI.encode_www_form_component` could be used instead. – Arsen7 Oct 23 '14 at 09:43
2

Ruby has the built-in URI library, and the Addressable gem, in particular Addressable::URI

I prefer Addressable::URI. It's very full featured and handles the encoding for you when you use the query_values= method.

I've seen some discussions about URI going through some growing pains so I tend to leave it alone for handling encoding/escaping until these things get sorted out:

the Tin Man
  • 158,662
  • 42
  • 215
  • 303