1

I have built a web scraper that is successfully pulling almost everything I need out of the web page I'm looking at. The goal is to pull the URL for a particular image associated with all the coffees found at a particular URL.

The rake task I have defined to complete the scraping is as follows:

mechanize = Mechanize.new
mechanize.get(url) do |page|
    page.links_with(:href => /products/).each do |link|
        coffee_page = link.click

            bean = Bean.new

            bean.acidity = coffee_page.css('[data-id="acidity"]').text.strip.gsub("acidity ","")
            bean.elevation = coffee_page.css('[data-id="elevation"]').text.strip.gsub("elevation ","")
            bean.roaster_id = "2"
            bean.harvest_season = coffee_page.css('[data-id="harvest"]').text.strip.gsub("harvest ","")
            bean.price = coffee_page.css('.price-wrap').text.gsub("$","")
            bean.roast_profile = coffee_page.css('[data-id="roast"]').text.strip.gsub("roast ","")
            bean.processing_type = coffee_page.css('[data-id="process"]').text.strip.gsub("process ","")
            bean.cultivar = coffee_page.css('[data-id="cultivar"]').text.strip.gsub("cultivar ","")
            bean.flavor_profiles = coffee_page.css('.price-wrap+ p').text.strip
            bean.country_of_origin = coffee_page.css('#pdp-order h1').text.strip
            bean.image_url = coffee_page.css('img data-featured-product-image').attr('src')

            if bean.country_of_origin == "Origin Set" || bean.country_of_origin == "Gift Card (online use only)"
                bean.destroy
            else
                ap bean
            end
    end
end

Now the information I need is all on the page, and I'm looking for the image URL that is found like the below, but for all the individual coffee_pages at the source page. It needs to be generic enough to pull this picture source but nothing else. I've tried a number of different css selectors but everything pulls either nil or blank.

<img src="//cdn.shopify.com/s/files/1/2220/0129/products/ceremony-product-gummy-bears_480x480.jpg?v=1551455589" alt="Burundi Kiryama" data-product-featured-image style="display:none">

The coffee_page I'm on is here: https://shop.ceremonycoffee.com/products/burundi-kiryama

  • Css does have substring matching, so you could use `img[src^='//cdn.shopify.com/s/files/']` (not sure if that is specific enough for your needs, you can scope to a parent if required). See https://stackoverflow.com/questions/8903313/using-regular-expression-in-css and https://www.w3.org/TR/selectors/#attribute-substrings – max pleaner Mar 19 '19 at 18:51
  • Let me know if my answer to your question is sufficient. If so please mark as correct. – NemyaNation Mar 28 '19 at 23:00
  • Please read "[ask]". When asking about a problem with your code we need the minimum data necessary to demonstrate the problem in the question itself. A link forces us to search through a page's HTML which wastes our time and discourages people from trying to help you. We need you to prepare the question so we can help you. In addition, now that the link is broken your question makes little sense. – the Tin Man May 23 '19 at 23:59

1 Answers1

0

You need to change

bean.image_url = coffee_page.css('img data-featured-product-image').attr('src')

to

bean.image_url = coffee_page.css('#mobile-only>img').attr('src')

If you can, always use nearby identifiers to locate the element you want to access.

NemyaNation
  • 983
  • 1
  • 12
  • 22