0

I am very new to Ruby and I am currently working on site-scraping using Nokogiri to practice. I would like to scrape the details from 'deals' from a random group-buying site. I have been able to successfully scrape a site but I am having problems in parsing the output. I tried the solutions suggested in here and also using regex. So far, I have failed.

I am trying to parse the following title/description from this page:

Frosty Frappes starting at P100 for P200 worth at Café Tavolo – up to 55% off

This is what I got:

FrostyFrappes starting at P100 for P200 worth at Caf Tavolo  up to 55% off

Here are the snippets in my code:

require 'uri'
require 'nokogiri'

html = open(url)
doc = Nokogiri::HTML(html.read)
doc.encoding = "utf-8"
title = doc.at_xpath('/html/body/div/div[9]/div[2]/div/div/div/h1/a')
puts title.content.to_s.strip.gsub(/[^0-9a-z%&!\n\/(). ]/i, '')

Please do tell me if I missed something out. Thank you.

Community
  • 1
  • 1
nmenego
  • 846
  • 3
  • 17
  • 36

1 Answers1

2

Your xpath is too rigid and your regex is removing chars you want to keep. Here's how I would do it:

title = doc.at('div#contentDealTitle h1 a').text.strip.gsub(/\s+/,' ')

That says take the text from the first a tag that comes after div#contentDealTitle and h1, strip it (remove leading and trailing spaces) and replace all sequences of 1 or more whitespace char with a single space.

pguardiario
  • 53,827
  • 19
  • 119
  • 159
  • Thanks a lot @pguardiario! I retained the original xpath I've been using and changed the regex using yours. I used `/\s+[^0-9a-z]/i,' '` and it worked! It seems like there was something wrong with how i constructed it in the first place. – nmenego Mar 13 '12 at 07:43
  • You're welcome. Consider changing your xpath so that minor layout changes won't break your script. Also I'm not sure why the [^0-9a-z] is necessary but consider the much simpler \W which is shorthand for any non-word char. – pguardiario Mar 13 '12 at 07:53
  • Many thanks(again)! I'll keep all those in mind. I simply used Firefox's Firebug to extract the xpath's (right click in Firebug inspector, copy xpath). Is there a smarter way to do so? – nmenego Mar 13 '12 at 08:14
  • Yes, when you're in the inspector panel just scan the heirarchy and try to construct a path that's based on css selectors rather than an absolute path. It's usually obvious right away. – pguardiario Mar 13 '12 at 08:40
  • 2
    Don't use Firebug's XPath! It leads to brittle scrapers. It treats all page elements as equally important and required, even ones you don't care about. @pguardiario's CSS is *much* better. You can also use the XPath `//div[@id='contentDealTitle]/h1/a` – Mark Thomas Mar 13 '12 at 09:04