0

running this code with mecahnize 2.7.3 and ruby 2.3.0dev:

require 'mechanize'
agent = Mechanize.new

agent.keep_alive = false
agent.open_timeout = 2
agent.read_timeout = 2
agent.ignore_bad_chunking = true
agent.gzip_enabled = false

url = 'http:%5C%5Cwww.scouts.org.uk'

agent.head(url)

Gives me this NoMethodError:

~/.rvm/gems/ruby-head/gems/mechanize-2.7.3/lib/mechanize/http/agent.rb:648:in resolve': undefined     
methodlength' for nil:NilClass (NoMethodError)

from ~/.rvm/gems/ruby-head/gems/mechanize-2.7.3/lib/mechanize/http/agent.rb:223:in `fetch'
from ~/.rvm/gems/ruby-head/gems/mechanize-2.7.3/lib/mechanize.rb:459:in `head

Is this a bug in mechanize or am I doing something wrong? If so how can it be fixed?

EDIT: the url is obviously worng, but im reading a lot of urls from a file and some of them might be wrong.

EDIT2: lets say I have a file like this http://pastie.org/9934756 I need to get the head of all the urls that are correct and ignore the others

pinpox
  • 179
  • 2
  • 10

3 Answers3

1

You write a wrong url, try this: url = 'http://scouts.org.uk'

Oleksandr Holubenko
  • 4,310
  • 2
  • 14
  • 28
  • I know. But there are a lot of urls and some of them may be wrong. Shouldnt the error be somethong like 404 not found or so? – pinpox Feb 10 '15 at 10:42
  • @user1759796 your mistake in the "%5C%5C" - it's wrong url, it's must seems like: "http:// google.com/", "http:// scouts.org.uk" etc (without space) – Oleksandr Holubenko Feb 10 '15 at 10:48
  • See my edit. I know that the url is wrong, I just need to deal with it correctly – pinpox Feb 10 '15 at 11:38
  • @user1759796 they all seems like first? – Oleksandr Holubenko Feb 10 '15 at 11:41
  • The urls may be different, valid or invalid. e.g 'http://something.com', "http://som%5c.com", "htt: asldkj.com", "ittp://something.com/" or whatever. I need to find the "good" urls and then process them with mechanize. If the url returns 404 thats no problem, because I can rescue that. Sorry stackoverflow cuts off the "http://" part – pinpox Feb 10 '15 at 11:52
0

Your target site is doing a redirect and uses meta refresh. Update your code to include those methods:

require 'mechanize'

agent = Mechanize.new
agent.keep_alive = false
agent.follow_meta_refresh = true
agent.redirect_ok = true
agent.open_timeout = 10
agent.read_timeout = 10
agent.ignore_bad_chunking = true
agent.gzip_enabled = false

url = 'http:%5C%5Cwww.scouts.org.uk'

begin
  page_head = agent.head(url)
rescue Exception => exception
  puts "Caught exception: #{exception.message}"
end

Result:

=> #Caught exception: undefined method `length' for nil:NilClass
JonB
  • 836
  • 1
  • 11
  • 15
  • That doesnt change anything. You used the correct url (without the %5c). I need to get some sort of error that I can catch if this occures, not a nomethoderror. The problem is I dont know if all urls will have the correct format – pinpox Feb 10 '15 at 11:36
  • Updated the code to catch the exception. How you actually handle it is up to you, I just put a basic example. More about [Ruby Exceptions](http://ruby-doc.org/core-1.9.3/Exception.html) and [exception handling](http://rubylearning.com/satishtalim/ruby_exceptions.html). – JonB Feb 10 '15 at 12:36
  • You may also want to check out [this post](http://stackoverflow.com/questions/1805761/check-if-url-is-valid-ruby). – JonB Feb 10 '15 at 12:47
0

You can add this method to check valid url or not :

require 'uri'
def valid?(url) 
    uri = URI.parse(url) 
    if uri.kind_of?(URI::HTTP) == true
        puts '+'
    else 
        puts '-'
    end
rescue URI::InvalidURIError 
    puts 'false '
end

['http://web.de',
'http://web.de/',
'http:%5c%5cweb.de',
'http:web.de',
'foo://web.de',
'http://we b.de',
'http://|web.de'].each { |i|
    valid?(i)
}

+

+

+

+

false

false

Oleksandr Holubenko
  • 4,310
  • 2
  • 14
  • 28
  • For the url the OP provided, this returns `true` but it's not. – JonB Feb 10 '15 at 12:54
  • Yes, they return true for some of them that will work in a browser but Mechanize will still not load them. – JonB Feb 10 '15 at 13:03
  • 1
    It seems the OP really wants to catch the exception when he has a bad URL. The real solution is to combine techniques to prevent a bad URL from getting passed to Mechanize in the first place. A combination of this technique along with 'scrubbing' the backslashes with something like `decoded_url = URI.unescape(url).gsub('\\','/')` should be employed. – JonB Feb 10 '15 at 13:09