35

I'm using open-uri to open URLs.

resp = open("http://sub_domain.domain.com")

If it contains underscore I get an error:

URI::InvalidURIError: the scheme http does not accept registry part: sub_domain.domain.com (or bad hostname?)

I understand that this is because according to RFC URLs can contain only letters and numbers. Is there any workaround?

John Bachir
  • 22,495
  • 29
  • 154
  • 227
Arty
  • 5,923
  • 9
  • 39
  • 44

9 Answers9

20

This looks like a bug in URI, and uri-open, HTTParty and many other gems make use of URI.parse.

Here's a workaround:

require 'net/http'
require 'open-uri'

def hopen(url)
  begin
    open(url)
  rescue URI::InvalidURIError
    host = url.match(".+\:\/\/([^\/]+)")[1]
    path = url.partition(host)[2] || "/"
    Net::HTTP.get host, path
  end
end

resp = hopen("http://dear_raed.blogspot.com/2009_01_01_archive.html")
stef
  • 14,172
  • 2
  • 48
  • 70
  • This is an ugly hack, but it works. The problem is that one of our partners forces us to use this domain name and we even have to add it to the hosts files on all servers, because it won't resolve... very nice! – Alex Kovshovik Apr 11 '13 at 18:08
18

URI has an old-fashioned idea of what an url looks like.

Lately I'm using addressable to get around that:

require 'open-uri'
require 'addressable/uri'

class URI::Parser
  def split url
    a = Addressable::URI::parse url
    [a.scheme, a.userinfo, a.host, a.port, nil, a.path, nil, a.query, a.fragment]
  end
end

resp = open("http://sub_domain.domain.com") # Yay!

Don't forget to gem install addressable

pguardiario
  • 53,827
  • 19
  • 119
  • 159
14

This initializer in my rails app seems to make URI.parse work at least:

# config/initializers/uri_underscore.rb
class URI::Generic
  def initialize_with_registry_check(scheme,
                 userinfo, host, port, registry,
                 path, opaque,
                 query,
                 fragment,
                 parser = DEFAULT_PARSER,
                 arg_check = false)
    if %w(http https).include?(scheme) && host.nil? && registry =~ /_/
      initialize_without_registry_check(scheme, userinfo, registry, port, nil, path, opaque, query, fragment, parser, arg_check)
    else
      initialize_without_registry_check(scheme, userinfo, host, port, registry, path, opaque, query, fragment, parser, arg_check)
    end
  end
  alias_method_chain :initialize, :registry_check
end
cluesque
  • 1,100
  • 11
  • 17
6

Here is a patch that solves the problem for a wide variety of situations (rest-client, open-uri, etc.) without using external gems or overriding parts of URI.parse:

module URI
  DEFAULT_PARSER = Parser.new(:HOSTNAME => "(?:(?:[a-zA-Z\\d](?:[-\\_a-zA-Z\\d]*[a-zA-Z\\d])?)\\.)*(?:[a-zA-Z](?:[-\\_a-zA-Z\\d]*[a-zA-Z\\d])?)\\.?")
end

Source: lib/uri/rfc2396_parser.rb#L86

Ruby-core has an open issue: https://bugs.ruby-lang.org/issues/8241

Larry Kyrala
  • 889
  • 2
  • 8
  • 18
  • This one works great. But it gives a warning because of constant DEFAULT_PARSER already existing. To prevent this I used: module URI original_verbose, $VERBOSE = $VERBOSE, nil DEFAULT_PARSER = Parser.new(:HOSTNAME => "(?:(?:[a-zA-Z\\d](?:[-\\_a-zA-Z\\d]*[a-zA-Z\\d])?)\\.)*(?:[a-zA-Z](?:[-\\_a-zA-Z\\d]*[a-zA-Z\\d])?)\\.?") $VERBOSE = original_verbose end` – fraank Mar 15 '17 at 09:04
  • This also solved my issue when using URI.extract and it was breaking on links with underscores in the subdomain. – Andrew Spode May 15 '20 at 11:55
3

An underscore can not be contained in a domain name like that. That is part of the DNS standard. Did you mean to use a dash(-)?

Even if open-uri didn't throw an error such a command would be pointless. Why? Because there is no way it can resolve such a domain name. At best you'd get an unknown host error. There is no way for you to register a domain name with an _ in it, and even running your own private DNS server, it is against the specification to use a _. You could bend the rules and allow it(by modifying the DNS server software), but then your operating system's DNS resolver won't support it, neither will your router's DNS software.

Solution: Don't try to use a _ in a DNS name. It won't work anywhere and it's against the specifications

Earlz
  • 62,085
  • 98
  • 303
  • 499
  • no, I meant exactly underscores. As I mentioned, I understand that it is not allowed by standards, but there are URLs like that (for instance on livejournal.com) and I have to deal with them. – Arty Mar 06 '11 at 06:35
  • @Arty ahh, I hadn't realized such a big player as livejournal would allow such RFC breakage. Welp, I don't know then :P – Earlz Mar 06 '11 at 16:43
  • 1
    Per [RFC 3986 section 2.3](http://www.ietf.org/rfc/rfc3986.txt), underscore is not a reserved character. It is unreserved. – the Tin Man Mar 07 '11 at 03:05
  • It's not just livejournal that does it, Windows allows underscores in machine names, and that way you can end up with a broken hostname having an underscore. It's true as Tin Man says that underscores are allowed in the hostname part of a generic URI since they're unreserved characters, but that doesn't contradict what Earlz says, that you can't (successfully) use such hostnames with DNS. The fact that it's allowed in a URI doesn't imply it'll actually resolve, presumably this is justified since registered domains and resolvable hosts are not the only possible uses of URIs. – Steve Jessop Sep 19 '11 at 23:05
  • These URLs will resolve, for example http://test_underscore.stackednotion.com/ will take you to my website. That still doesn't get around that it isn't a valid URL, so shouldn't be used though :) – Luca Spiller Dec 06 '11 at 15:38
  • 15
    **Subdomains are allowed to have underscores.** http://stackoverflow.com/a/2183140/203130 – coreyward Feb 25 '12 at 20:54
  • Amazon S3 buckets are also allowed to have underscores, and do create domain names. This means even more domains you'd like to access using ruby, URI, and OpenURI – cluesque Feb 11 '13 at 15:46
  • I beg to differ. Underscores work in subdomains just fine, in all cases I have encountered, as demonstrated by a great many S3 buckets. The spec may not allow it, but that's a different matter. – superluminary Oct 22 '13 at 16:06
  • RFC 2181 clarified some parts of the DNS spec and says a label may be any valid binary string. It also says that applications can have more constraints. Looking through RFC 3986, I read it as saying the hostname portion of the URI is valid if you can pass it do the DNS system. – kbyrd Nov 26 '13 at 17:58
2

Here is another ugly hack, no gem needed:

def parse(url = nil)
    begin
        URI.parse(url)
    rescue URI::InvalidURIError
        host = url.match(".+\:\/\/([^\/]+)")[1]
        uri = URI.parse(url.sub(host, 'dummy-host'))
        uri.instance_variable_set('@host', host)
        uri
    end
end
paulguy
  • 1,045
  • 16
  • 28
sheerun
  • 1,728
  • 14
  • 27
2

I had this same error while trying to use gem update / gem install etc. so I used the IP address instead and its fine now.

Julian Mann
  • 6,256
  • 5
  • 31
  • 43
0

I recommend using the Curb gem: https://github.com/taf2/curb which just wraps libcurl. Here is a simple example that will automatically follow redirects and print the response code and response body:

rsp = Curl::Easy.http_get(url){|curl| curl.follow_location = true; curl.max_redirects=10;}
puts rsp.response_code
puts rsp.body_str

I usually avoid the ruby URI classes since they are too strick to the spec which as you know the web is the wild west :) Curl / curb handles every url I throw at it like a champ.

TomDavies
  • 652
  • 4
  • 15
0

For anyone stumbling upon this:

Ruby's URI.parse used to be based on RFC2396 (published in Aug 1998), see https://bugs.ruby-lang.org/issues/8241

But starting at ruby 2.2 URI is upgraded into RFC 3986, so if you're on a modern version, no monkey patches are necessary now.

Vasfed
  • 18,013
  • 10
  • 47
  • 53
  • You are not wrong that URI supports RFC3986 parsing, but the default parser as of Ruby 3.0 and all prior versions is still `RFC2396` as you can see here: https://github.com/ruby/ruby/blob/v3_0_3/lib/uri/common.rb#L17 – oliverguenther Dec 13 '21 at 14:55