The problem is variable scope, and is very common.
I'd rewrite the code like this:
require 'open-uri'
require 'nokogiri'
doc = Nokogiri::HTML(open('https://www.google.fr/search?q=estimation+immobilier'))
def parse_results(doc)
_links = []
doc.search('.g').each do |element|
_links << parse_block(element)
end
_links
end
def parse_block(element)
tempo = element.search('.r').to_s
tempo.scan(/<a href=\"\/url\?q=(.*)&sa=U/)[0][0]
end
links = parse_results(doc)
puts links
links
could be defined as an instance, class or global variable, but all of those have code smell. You'd be trying to circumvent scoping, which is really your friend when it comes to avoiding wasting space on the variable stack.
scan
is going to return an array of results, so push its results to _links
.
map
wasn't the right method for what you're doing; each
is more appropriate since you're looping over the results of searching for class="g"
in the HTML. Using map
, you could write parse_results()
like:
def parse_results(doc)
doc.search('.g').map { |element| parse_block(element) }
end
parse_block()
isn't written correctly, or at least it can be written a lot more idiomatically for Nokogiri. If you ever have to resort to using regex when using an XML or HTML parser, you know there's something that should be reconsidered. Looking at what's happening, here's what the code sees as it dives through parse_results()
and parse_block()
:
doc.search('.g').first.search('.r').to_s
# => "<h3 class=\"r\"><a href=\"/url?q=http://www.meilleursagents.com/estimation-immobiliere/&sa=U&ei=z33SU5a-LMPaoASe94GYBw&ved=0CBQQFjAA&usg=AFQjCNH_Nfe9VGSqO8AU_mc3TL_ZsyNRFw\"><b>Estimation immobiliere</b> gratuite - MeilleursAgents.com</a></h3>\n"
You're trying to grab a parameter from the links, so use Nokogiri to do that cleanly, instead of trying to use a pattern and scan
. I opened the page and parsed it as you did, then tried this:
doc.search('.g h3.r a').map(&:to_html)
# => ["<a href=\"/url?q=http://www.meilleursagents.com/estimation-immobiliere/&sa=U&ei=gYDSU-D3AsHXiwLj1oGwBg&ved=0CBQQFjAA&usg=AFQjCNG59EuN3nByaD1NEg7t3garmotJTg\"><b>Estimation immobiliere</b> gratuite - MeilleursAgents.com</a>",
# "<a href=\"/url?q=http://www.drimki.fr/estimation-immobiliere-gratuite&sa=U&ei=gYDSU-D3AsHXiwLj1oGwBg&ved=0CBoQFjAB&usg=AFQjCNH4HpJ9WRzLhpSGVfRwcogxuJPZDA\"><b>Estimation immobili\u00E8re</b> gratuite (maison, appartement <b>...</b> - Drimki</a>",
# "<a href=\"/url?q=http://www.pap.fr/evaluation/estimation-immobiliere&sa=U&ei=gYDSU-D3AsHXiwLj1oGwBg&ved=0CCAQFjAC&usg=AFQjCNGcAeTDeib6hBVcD931CGyDRoPx6A\"><b>Estimation immobili\u00E8re</b> avec Particulier \u00E0 Particulier | De <b>...</b> - P.a.p</a>",
# "<a href=\"/url?q=http://www.lacoteimmo.com/&sa=U&ei=gYDSU-D3AsHXiwLj1oGwBg&ved=0CCYQFjAD&usg=AFQjCNEjsDh8wYnj9XQBuuotGWOtJKrBYQ\">LaCoteImmo - <b>Estimation immobili\u00E8re</b> et Prix immobilier</a>",
# "<a href=\"/url?q=http://www.efficity.com/estimation-immobiliere/&sa=U&ei=gYDSU-D3AsHXiwLj1oGwBg&ved=0CCwQFjAE&usg=AFQjCNH_2RgWZ4VMeP29eKt1MAZTySOSZA\"><b>Estimation immobili\u00E8re</b> - Efficity</a>",
# "<a href=\"/url?q=http://www.paruvendu.fr/pa/prix-immobilier-prix-m2-estimation-gratuite-bien-immobilier/&sa=U&ei=gYDSU-D3AsHXiwLj1oGwBg&ved=0CDIQFjAF&usg=AFQjCNFz0VTEKcTJrzIgT4nwOMnm85vX5g\"><b>Estimation</b> gratuite d'un bien <b>immobilier</b> - ParuVendu</a>",
# "<a href=\"/url?q=http://www.meilleurtaux.com/services-immo/vendre-un-bien-immobilier/estimation-immobiliere.html&sa=U&ei=gYDSU-D3AsHXiwLj1oGwBg&ved=0CDgQFjAG&usg=AFQjCNGjSqsVQe0GB0uzTvewfR-FtfGUww\"><b>Estimer</b> la valeur de son bien <b>immobilier</b>- Meilleurtaux.com</a>",
# "<a href=\"/url?q=http://prix-immobilier.latribune.fr/estimation-immobiliere/&sa=U&ei=gYDSU-D3AsHXiwLj1oGwBg&ved=0CD4QFjAH&usg=AFQjCNFelUOIWlvj09l5RIG0KF8CiY9kLw\"><b>Estimation immobiliere</b> gratuite avec MeilleursAgents.com</a>",
# "<a href=\"/url?q=http://www.refleximmo.com/estimation-immobiliere-gratuite-appartement&sa=U&ei=gYDSU-D3AsHXiwLj1oGwBg&ved=0CEQQFjAI&usg=AFQjCNErmk1sUrmrAPU188KyyfYG_O0cMw\"><b>Estimation</b> gratuite de votre appartement en ligne - Refleximmo</a>",
# "<a href=\"/url?q=http://www.capital.fr/immobilier/estimation-immobiliere&sa=U&ei=gYDSU-D3AsHXiwLj1oGwBg&ved=0CEoQFjAJ&usg=AFQjCNEVRbp_kOwOmT86TWHEvFbjm6W3nA\"><b>Estimation Immobili\u00E8re</b> - Immobilier - Capital.fr</a>"]
A bit more comprehensive CSS narrowed down the returned results significantly.
A bit of tweaking results in:
doc.search('.g h3.r a').map{ |a| a['href'] }
# => ["/url?q=http://www.meilleursagents.com/estimation-immobiliere/&sa=U&ei=OoHSU7KaEszwoAS__YDoCA&ved=0CBQQFjAA&usg=AFQjCNFNCH0iR3pr0fQX6wSjcj1_s3CsRg",
# "/url?q=http://www.drimki.fr/estimation-immobiliere-gratuite&sa=U&ei=OoHSU7KaEszwoAS__YDoCA&ved=0CBoQFjAB&usg=AFQjCNGUbFcsWWQY-bc8Vu-d-GD9YFcbVg",
# "/url?q=http://www.pap.fr/evaluation/estimation-immobiliere&sa=U&ei=OoHSU7KaEszwoAS__YDoCA&ved=0CCAQFjAC&usg=AFQjCNGztbZlDWWGS4kNPHzR06ayRdAQKg",
# "/url?q=http://www.lacoteimmo.com/&sa=U&ei=OoHSU7KaEszwoAS__YDoCA&ved=0CCYQFjAD&usg=AFQjCNEZK_JVduJKJvFpDDXu4yIsTXGMFg",
# "/url?q=http://www.efficity.com/estimation-immobiliere/&sa=U&ei=OoHSU7KaEszwoAS__YDoCA&ved=0CCwQFjAE&usg=AFQjCNHHc-GuJoHXTx3N3_Ex_fz1KUp1cg",
# "/url?q=http://www.paruvendu.fr/pa/prix-immobilier-prix-m2-estimation-gratuite-bien-immobilier/&sa=U&ei=OoHSU7KaEszwoAS__YDoCA&ved=0CDIQFjAF&usg=AFQjCNGmwWmo19asoooWz6Lbh0YMOC8wlg",
# "/url?q=http://www.meilleurtaux.com/services-immo/vendre-un-bien-immobilier/estimation-immobiliere.html&sa=U&ei=OoHSU7KaEszwoAS__YDoCA&ved=0CDgQFjAG&usg=AFQjCNFJ_fAsPBmZvVU60jRLh-yKzvuEiw",
# "/url?q=http://prix-immobilier.latribune.fr/estimation-immobiliere/&sa=U&ei=OoHSU7KaEszwoAS__YDoCA&ved=0CD4QFjAH&usg=AFQjCNHHaVmKGg4jiaT-6AwZAfby2-H4sg",
# "/url?q=http://www.refleximmo.com/estimation-immobiliere-gratuite-appartement&sa=U&ei=OoHSU7KaEszwoAS__YDoCA&ved=0CEQQFjAI&usg=AFQjCNGiBMMYrK-EO9wqIh82eW2uFT0n8w",
# "/url?q=http://www.capital.fr/immobilier/estimation-immobiliere&sa=U&ei=OoHSU7KaEszwoAS__YDoCA&ved=0CEoQFjAJ&usg=AFQjCNEf8FQuKCYBMXBB5FA2dJ2gor4Wmg"]
At this point it's obvious we're looking at an array of absolute URLs, which can be handled using Ruby's built-in URI class:
require 'uri'
doc.search('.g h3.r a').map{ |a|
uri = URI.parse(a['href'])
query_hash = Hash[URI::decode_www_form(uri.query)]
query_hash['q']
}
# => [
"http://www.meilleursagents.com/estimation-immobiliere/",
"http://www.drimki.fr/estimation-immobiliere-gratuite",
"http://www.pap.fr/evaluation/estimation-immobiliere",
...
That should give you enough information to rewrite your code a bit more robustly. Regular expressions are not good tools for parsing HTML, and it's better to use well-tested, pre-built wheels whenever possible, like URI.
The reason I say this approach is more robust is because of this piece of code:
links << tempo.scan(/<a href=\"\/url\?q=(.*)&sa=U/)[0][0]
That search is very prone to breaking. URL formats can change quickly, especially if a site suspects that people are scraping their pages and they don't want scraping to happen, such as Google. They could easily change the order of the parameters, they could change the way the link is written in the page, etc., since HTML allows very liberal formatting of the source and a browser will still render the same view to the user. Imagine the fun you'd have if Google chose to render a link like:
<a
href="/url?amp;sa=U&q=...
The regex would break, causing your code to break, whereas using URI and Nokogiri to drill down would continue to work.