1

I want to scrap links from a Google search query.

I can't save results in a TAB (links):

error : test.rb:17:in `parse_result': undefined local variable or
method `links' for main:Object (NameError)

This is my code:

require 'open-uri'
require 'nokogiri'

doc = Nokogiri::HTML(open('https://www.google.fr/search?q=estimation+immobilier'))

links = []

def parse_results(doc)
    doc.search('.g').map do |element|
      parse_block(element)
    end
end


def parse_block(element)
    tempo = element.search('.r').to_s
    links << tempo.scan(/<a href=\"\/url\?q=(.*)&amp;sa=U/)[0][0]
end

parse_results(doc)

puts links
the Tin Man
  • 158,662
  • 42
  • 215
  • 303
Gilbert Val
  • 25
  • 1
  • 5

2 Answers2

2

The problem is variable scope, and is very common.

I'd rewrite the code like this:

require 'open-uri'
require 'nokogiri'

doc = Nokogiri::HTML(open('https://www.google.fr/search?q=estimation+immobilier'))

def parse_results(doc)
  _links = []
  doc.search('.g').each do |element|
    _links << parse_block(element)
  end
  _links
end

def parse_block(element)
  tempo = element.search('.r').to_s
  tempo.scan(/<a href=\"\/url\?q=(.*)&amp;sa=U/)[0][0]
end

links = parse_results(doc)

puts links

links could be defined as an instance, class or global variable, but all of those have code smell. You'd be trying to circumvent scoping, which is really your friend when it comes to avoiding wasting space on the variable stack.

scan is going to return an array of results, so push its results to _links.

map wasn't the right method for what you're doing; each is more appropriate since you're looping over the results of searching for class="g" in the HTML. Using map, you could write parse_results() like:

def parse_results(doc)
  doc.search('.g').map { |element| parse_block(element) }
end

parse_block() isn't written correctly, or at least it can be written a lot more idiomatically for Nokogiri. If you ever have to resort to using regex when using an XML or HTML parser, you know there's something that should be reconsidered. Looking at what's happening, here's what the code sees as it dives through parse_results() and parse_block():

doc.search('.g').first.search('.r').to_s
# => "<h3 class=\"r\"><a href=\"/url?q=http://www.meilleursagents.com/estimation-immobiliere/&amp;sa=U&amp;ei=z33SU5a-LMPaoASe94GYBw&amp;ved=0CBQQFjAA&amp;usg=AFQjCNH_Nfe9VGSqO8AU_mc3TL_ZsyNRFw\"><b>Estimation immobiliere</b> gratuite - MeilleursAgents.com</a></h3>\n"

You're trying to grab a parameter from the links, so use Nokogiri to do that cleanly, instead of trying to use a pattern and scan. I opened the page and parsed it as you did, then tried this:

doc.search('.g h3.r a').map(&:to_html)
# => ["<a href=\"/url?q=http://www.meilleursagents.com/estimation-immobiliere/&amp;sa=U&amp;ei=gYDSU-D3AsHXiwLj1oGwBg&amp;ved=0CBQQFjAA&amp;usg=AFQjCNG59EuN3nByaD1NEg7t3garmotJTg\"><b>Estimation immobiliere</b> gratuite - MeilleursAgents.com</a>",
#     "<a href=\"/url?q=http://www.drimki.fr/estimation-immobiliere-gratuite&amp;sa=U&amp;ei=gYDSU-D3AsHXiwLj1oGwBg&amp;ved=0CBoQFjAB&amp;usg=AFQjCNH4HpJ9WRzLhpSGVfRwcogxuJPZDA\"><b>Estimation immobili\u00E8re</b> gratuite (maison, appartement <b>...</b> - Drimki</a>",
#     "<a href=\"/url?q=http://www.pap.fr/evaluation/estimation-immobiliere&amp;sa=U&amp;ei=gYDSU-D3AsHXiwLj1oGwBg&amp;ved=0CCAQFjAC&amp;usg=AFQjCNGcAeTDeib6hBVcD931CGyDRoPx6A\"><b>Estimation immobili\u00E8re</b> avec Particulier \u00E0 Particulier | De <b>...</b> - P.a.p</a>",
#     "<a href=\"/url?q=http://www.lacoteimmo.com/&amp;sa=U&amp;ei=gYDSU-D3AsHXiwLj1oGwBg&amp;ved=0CCYQFjAD&amp;usg=AFQjCNEjsDh8wYnj9XQBuuotGWOtJKrBYQ\">LaCoteImmo - <b>Estimation immobili\u00E8re</b> et Prix immobilier</a>",
#     "<a href=\"/url?q=http://www.efficity.com/estimation-immobiliere/&amp;sa=U&amp;ei=gYDSU-D3AsHXiwLj1oGwBg&amp;ved=0CCwQFjAE&amp;usg=AFQjCNH_2RgWZ4VMeP29eKt1MAZTySOSZA\"><b>Estimation immobili\u00E8re</b> - Efficity</a>",
#     "<a href=\"/url?q=http://www.paruvendu.fr/pa/prix-immobilier-prix-m2-estimation-gratuite-bien-immobilier/&amp;sa=U&amp;ei=gYDSU-D3AsHXiwLj1oGwBg&amp;ved=0CDIQFjAF&amp;usg=AFQjCNFz0VTEKcTJrzIgT4nwOMnm85vX5g\"><b>Estimation</b> gratuite d'un bien <b>immobilier</b> - ParuVendu</a>",
#     "<a href=\"/url?q=http://www.meilleurtaux.com/services-immo/vendre-un-bien-immobilier/estimation-immobiliere.html&amp;sa=U&amp;ei=gYDSU-D3AsHXiwLj1oGwBg&amp;ved=0CDgQFjAG&amp;usg=AFQjCNGjSqsVQe0GB0uzTvewfR-FtfGUww\"><b>Estimer</b> la valeur de son bien <b>immobilier</b>- Meilleurtaux.com</a>",
#     "<a href=\"/url?q=http://prix-immobilier.latribune.fr/estimation-immobiliere/&amp;sa=U&amp;ei=gYDSU-D3AsHXiwLj1oGwBg&amp;ved=0CD4QFjAH&amp;usg=AFQjCNFelUOIWlvj09l5RIG0KF8CiY9kLw\"><b>Estimation immobiliere</b> gratuite avec MeilleursAgents.com</a>",
#     "<a href=\"/url?q=http://www.refleximmo.com/estimation-immobiliere-gratuite-appartement&amp;sa=U&amp;ei=gYDSU-D3AsHXiwLj1oGwBg&amp;ved=0CEQQFjAI&amp;usg=AFQjCNErmk1sUrmrAPU188KyyfYG_O0cMw\"><b>Estimation</b> gratuite de votre appartement en ligne - Refleximmo</a>",
#     "<a href=\"/url?q=http://www.capital.fr/immobilier/estimation-immobiliere&amp;sa=U&amp;ei=gYDSU-D3AsHXiwLj1oGwBg&amp;ved=0CEoQFjAJ&amp;usg=AFQjCNEVRbp_kOwOmT86TWHEvFbjm6W3nA\"><b>Estimation Immobili\u00E8re</b> - Immobilier - Capital.fr</a>"]

A bit more comprehensive CSS narrowed down the returned results significantly.

A bit of tweaking results in:

doc.search('.g h3.r a').map{ |a| a['href'] }
# => ["/url?q=http://www.meilleursagents.com/estimation-immobiliere/&sa=U&ei=OoHSU7KaEszwoAS__YDoCA&ved=0CBQQFjAA&usg=AFQjCNFNCH0iR3pr0fQX6wSjcj1_s3CsRg",
#     "/url?q=http://www.drimki.fr/estimation-immobiliere-gratuite&sa=U&ei=OoHSU7KaEszwoAS__YDoCA&ved=0CBoQFjAB&usg=AFQjCNGUbFcsWWQY-bc8Vu-d-GD9YFcbVg",
#     "/url?q=http://www.pap.fr/evaluation/estimation-immobiliere&sa=U&ei=OoHSU7KaEszwoAS__YDoCA&ved=0CCAQFjAC&usg=AFQjCNGztbZlDWWGS4kNPHzR06ayRdAQKg",
#     "/url?q=http://www.lacoteimmo.com/&sa=U&ei=OoHSU7KaEszwoAS__YDoCA&ved=0CCYQFjAD&usg=AFQjCNEZK_JVduJKJvFpDDXu4yIsTXGMFg",
#     "/url?q=http://www.efficity.com/estimation-immobiliere/&sa=U&ei=OoHSU7KaEszwoAS__YDoCA&ved=0CCwQFjAE&usg=AFQjCNHHc-GuJoHXTx3N3_Ex_fz1KUp1cg",
#     "/url?q=http://www.paruvendu.fr/pa/prix-immobilier-prix-m2-estimation-gratuite-bien-immobilier/&sa=U&ei=OoHSU7KaEszwoAS__YDoCA&ved=0CDIQFjAF&usg=AFQjCNGmwWmo19asoooWz6Lbh0YMOC8wlg",
#     "/url?q=http://www.meilleurtaux.com/services-immo/vendre-un-bien-immobilier/estimation-immobiliere.html&sa=U&ei=OoHSU7KaEszwoAS__YDoCA&ved=0CDgQFjAG&usg=AFQjCNFJ_fAsPBmZvVU60jRLh-yKzvuEiw",
#     "/url?q=http://prix-immobilier.latribune.fr/estimation-immobiliere/&sa=U&ei=OoHSU7KaEszwoAS__YDoCA&ved=0CD4QFjAH&usg=AFQjCNHHaVmKGg4jiaT-6AwZAfby2-H4sg",
#     "/url?q=http://www.refleximmo.com/estimation-immobiliere-gratuite-appartement&sa=U&ei=OoHSU7KaEszwoAS__YDoCA&ved=0CEQQFjAI&usg=AFQjCNGiBMMYrK-EO9wqIh82eW2uFT0n8w",
#     "/url?q=http://www.capital.fr/immobilier/estimation-immobiliere&sa=U&ei=OoHSU7KaEszwoAS__YDoCA&ved=0CEoQFjAJ&usg=AFQjCNEf8FQuKCYBMXBB5FA2dJ2gor4Wmg"]

At this point it's obvious we're looking at an array of absolute URLs, which can be handled using Ruby's built-in URI class:

require 'uri'
doc.search('.g h3.r a').map{ |a| 
  uri = URI.parse(a['href'])
  query_hash = Hash[URI::decode_www_form(uri.query)]
  query_hash['q']
}
# => [
    "http://www.meilleursagents.com/estimation-immobiliere/",
    "http://www.drimki.fr/estimation-immobiliere-gratuite",
    "http://www.pap.fr/evaluation/estimation-immobiliere",
    ...

That should give you enough information to rewrite your code a bit more robustly. Regular expressions are not good tools for parsing HTML, and it's better to use well-tested, pre-built wheels whenever possible, like URI.

The reason I say this approach is more robust is because of this piece of code:

links << tempo.scan(/<a href=\"\/url\?q=(.*)&amp;sa=U/)[0][0]

That search is very prone to breaking. URL formats can change quickly, especially if a site suspects that people are scraping their pages and they don't want scraping to happen, such as Google. They could easily change the order of the parameters, they could change the way the link is written in the page, etc., since HTML allows very liberal formatting of the source and a browser will still render the same view to the user. Imagine the fun you'd have if Google chose to render a link like:

<a
 href="/url?amp;sa=U&q=...

The regex would break, causing your code to break, whereas using URI and Nokogiri to drill down would continue to work.

Community
  • 1
  • 1
the Tin Man
  • 158,662
  • 42
  • 215
  • 303
1

It works if you make links and instance variable:

require 'open-uri'
require 'nokogiri'

doc = Nokogiri::HTML(open('https://www.google.fr/search?q=estimation+immobilier'))

@links = []

def parse_results(doc)
    doc.search('.g').map do |element|
      parse_block(element)
    end
end


def parse_block(element)
    tempo = element.search('.r').to_s
    @links << tempo.scan(/<a href=\"\/url\?q=(.*)&amp;sa=U/)[0][0]
end

parse_results(doc)

puts @links
DiegoSalazar
  • 13,361
  • 2
  • 38
  • 55
  • just to complement this answer, what you have to understand is that method definitions are not closures, which means they do not know about the local variables "around" them when they're created. You were trying to use `links` inside a method, but it has not been defined inside it, it is undefined and raises an error. – Olivier Lance Jul 25 '14 at 15:43
  • Thx a lot Diego ! I understood your explanation about vaiables in/outside a method – Gilbert Val Jul 25 '14 at 15:51
  • One small addition to @OlivierLance ’s comment: the functions are actually not laying in the “global” space, rather than they belong to the anonymous Object’s instance. That’s why the trick with `@links` works. – Aleksei Matiushkin Jul 25 '14 at 15:51