Some spaces are removed when scraping a string using Nokogiri

Question

I am very new to Ruby and I am currently working on site-scraping using Nokogiri to practice. I would like to scrape the details from 'deals' from a random group-buying site. I have been able to successfully scrape a site but I am having problems in parsing the output. I tried the solutions suggested in here and also using regex. So far, I have failed.

I am trying to parse the following title/description from this page:

Frosty Frappes starting at P100 for P200 worth at Café Tavolo – up to 55% off

This is what I got:

FrostyFrappes starting at P100 for P200 worth at Caf Tavolo  up to 55% off

Here are the snippets in my code:

require 'uri'
require 'nokogiri'

html = open(url)
doc = Nokogiri::HTML(html.read)
doc.encoding = "utf-8"
title = doc.at_xpath('/html/body/div/div[9]/div[2]/div/div/div/h1/a')
puts title.content.to_s.strip.gsub(/[^0-9a-z%&!\n\/(). ]/i, '')

Please do tell me if I missed something out. Thank you.

score 2 · Accepted Answer · answered Mar 13 '12 at 07:01

2

Your xpath is too rigid and your regex is removing chars you want to keep. Here's how I would do it:

title = doc.at('div#contentDealTitle h1 a').text.strip.gsub(/\s+/,' ')

That says take the text from the first a tag that comes after div#contentDealTitle and h1, strip it (remove leading and trailing spaces) and replace all sequences of 1 or more whitespace char with a single space.

answered Mar 13 '12 at 07:01

pguardiario

53,827
19
119
159

Thanks a lot @pguardiario! I retained the original xpath I've been using and changed the regex using yours. I used `/\s+[^0-9a-z]/i,' '` and it worked! It seems like there was something wrong with how i constructed it in the first place. – nmenego Mar 13 '12 at 07:43
You're welcome. Consider changing your xpath so that minor layout changes won't break your script. Also I'm not sure why the [^0-9a-z] is necessary but consider the much simpler \W which is shorthand for any non-word char. – pguardiario Mar 13 '12 at 07:53
Many thanks(again)! I'll keep all those in mind. I simply used Firefox's Firebug to extract the xpath's (right click in Firebug inspector, copy xpath). Is there a smarter way to do so? – nmenego Mar 13 '12 at 08:14
Yes, when you're in the inspector panel just scan the heirarchy and try to construct a path that's based on css selectors rather than an absolute path. It's usually obvious right away. – pguardiario Mar 13 '12 at 08:40
2

Don't use Firebug's XPath! It leads to brittle scrapers. It treats all page elements as equally important and required, even ones you don't care about. @pguardiario's CSS is *much* better. You can also use the XPath `//div[@id='contentDealTitle]/h1/a` – Mark Thomas Mar 13 '12 at 09:04

Some spaces are removed when scraping a string using Nokogiri

1 Answers1