2

I have a small crawler/screen-scraping script that used to work half a year ago, but now, it doesnt work anymore. I checked the html and css values for the reg expression in the page source, but they are still the same, so from this point of view, it should work. Any guesses?

require "open-uri"

# output file
f = open 'results.csv', 'w+'

# output string
results = ""

begin

  # crawl first 20 pages
  for i in (1..20)
    open("http://www.example-#{i}.com") {|url|

      # check each line using regular expression
      url.each_line { |line|
        if line =~ /class=\"L1g\" onclick=\"s_objectID=\'foobar\'\">([^<]+)<\/a><\/h3><\/li>/
          # if regular expression matches then add to results
          results += $1 + "\n"
        end
      }
    }
  end
ensure
  # write to and close file
  f.print results
  f.close
end
hebe
  • 387
  • 1
  • 2
  • 14
  • Where you say it doesn't work, what happens? – mikej Oct 18 '10 at 06:57
  • +1 for breaking Stack Overflow's syntax highlighter.What exception message does it produce? Also, have you tried any debugging approaches mentioned in [How do I debug ruby scripts?](http://stackoverflow.com/questions/3955688/how-do-i-debug-ruby-scripts) – Andrew Grimm Oct 18 '10 at 06:57
  • So the page is the same as always, and it has worked in the past. Did you upgrade Ruby? – Sirupsen Oct 18 '10 at 08:48
  • Have you tried turning it off and on again? – Lars Haugseth Oct 18 '10 at 09:45
  • Hey guys, thanks for the comments and sorry for the late answer. The script runs successfully, but it produces an empty csv file. I did not yet tried debugging. Since it didn't produce any error messages, I thought I can skip that. I visited the website and cross checked the concerned url, but it is still the same. @Lars what did you mean with turning it off and on? btw: is there a badge for breaking the syntax highligher? ;) – hebe Oct 20 '10 at 11:19

1 Answers1

0

The target website would appear to have changed the structure of their page so your Regex no longer matches.

This is a good example of why you should not scrape pages using Regex to match content. Try reworking your script using a DOM parser like Nokogiri. This will not necessarily stop your script from breaking but will at least allow it to survive minor changes.

The reason it is not working can be seen in this Rubular link

Steve Weet
  • 28,126
  • 11
  • 70
  • 86
  • Obligatory HTML and regular expressions link: [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – Andrew Grimm Oct 18 '10 at 22:26