Web Scraping with Nokogiri and Mechanize

Question

I am parsing prada.com and would like to scrape data in the div class "nextItem" and get its name and price. Here is my code:

require 'rubygems'
require 'mechanize'
require 'nokogiri'
require 'open-uri'
agent = Mechanize.new
page = agent.get('http://www.prada.com/en/US/e-store/department/woman/handbags.html?cmp=from_home')
fp = File.new('prada_prices','w')
html_doc = Nokogiri::HTML(page)
page = html_doc.xpath("//ol[@class='nextItem']")
page.each do {|i| fp.write(i.text + "\n")}
end

I get an error and no output. What I think I am doing is instantiating a mechanize object and calling it agent. Then creating a page variable and assigning it the url provided. Then creating a variable that is a nokogiri object with the mechanize url passed in Then searching the url for all class references that are titled nextItem Then printing all the data contained there

Can someone show me where I might have went wrong?

Prada seems to hide the name somehow... Do you know where in the HTML the name is stored? — davegson, Jan 27 '15 at 15:13
and they seem to load allot of stuff via JS... So it may be very hard to scrape. Just tested my attempt, which won't work... — davegson, Jan 27 '15 at 15:16

score 2 · Answer 1 · edited May 23 '17 at 12:16

2

Since Prada's website dynamically loads its content via JavaScript, it will be hard to scrape its content. See "Scraping dynamic content in a website" for more information.

Generally speaking, with Mechanize, after you get a page:

page = agent.get(page_url)

you can easily search items with CSS selectors and scrape for data:

next_items = page.search(".fooClass")

next_items.each do |item|
  price = item.search(".fooPrice").text
end

Then simply handle the strings or generate hashes as you desire.

edited May 23 '17 at 12:16

Community

1
1

answered Jan 27 '15 at 15:12

davegson

8,205
4
51
71

I tried your approach and still get no output for price. I think there is a problem with how I am understanding the DOM? TheChamp the HTML is stored in a deeply nested div class called nextItem, and each item contains an id identifying the item and text regarding the price require 'rubygems' require 'mechanize' agent = Mechanize.new page = agent.get('http://www.prada.com/en/US/e-store/department/woman/handbags.html?cmp=from_home') item = page.search(".nextItem") item.each do |item| price = item.search(".itemPrice").text puts price end – OrelFligelman Jan 27 '15 at 15:31
If you look at the source code of the site, you will see the `nextItem` divs are not loaded. They are loaded dynamically via JS. Check [this questions](http://stackoverflow.com/questions/8323728/scraping-dynamic-content-in-a-website), as provided in the answer. – davegson Jan 27 '15 at 15:34

hahcho · Answer 2 · 2015-01-30T12:17:32.543

0

Here are the wrong parts:

Check again the block syntax - use {} or do/end but not both in the same time.
Mechanize#get returns a Mechanize::Page which act as a Nokogiri document, at least it has search, xpath, css. Use them instead of trying to coerce the document to a Nokogiri::HTML object.
There is no need to require 'open-uri', and require 'nokogiri' when you are not using them directly.
Finally check maybe more about Ruby's basics before continuing with web scraping.

Here is the code with fixes:

require 'rubygems'
require 'mechanize'
agent = Mechanize.new
page = agent.get('http://www.prada.com/en/US/e-store/department/woman/handbags.html?cmp=from_home')
fp = File.new('prada_prices','w')
page = page.search("//ol[@class='nextItem']").each do |i| 
  fp.write(i.text + "\n")
end
fp.close

edited Jan 30 '15 at 12:17

answered Jan 27 '15 at 15:10

hahcho

1,369
9
17

1

Use the block form of `File.open` instead of `File.new` and assigning to a variable. Also, mixing `do`/`end` and `{`/`}` is fine. Use `do`/`end` for multiple-line `each`-type blocks. Use `{`/`}` for single line and blocks like `map` that return values. – the Tin Man Jan 27 '15 at 19:55
1

Thanks for the edit @theTinMan but I edited his code to work. Gave him some general advices. Tried to change the code as little as possible in order to make it more understandable. I know the ruby style guide but I do not think anything from it will help with the answear. – hahcho Jan 28 '15 at 10:07
It isn't necessary to reuse their code just to preserve their familiarity with it. Poor programming practice being propagated is a disservice. Instead show the fix in the context of how the code *should* be written. The idea isn't to just make their code work, it's to show how they should make it work. It's the difference between giving someone a fish and teaching them how to fish. – the Tin Man Jan 29 '15 at 17:49
1

You are right for the poor programming practice but in these code there are just some cosmetic changes that really does not change it so much. It is better if you use `File#open` but is not much of a difference. I do not see point in explaining my decision for the `do/end` instead `{}`, it just brings noise to the answear. – hahcho Jan 30 '15 at 12:16
Jeez. It's almost like @hahcho is trying to meet the OP where he IS rather than where s/he "should" be. Almost like...some kind of...what's the word?....TEACHER! – Ramy Apr 23 '21 at 10:57

Web Scraping with Nokogiri and Mechanize

2 Answers2