3

I am developing crawler using anemone gem (Ruby- 1.8.7 and Rails 3.1.1). How should I skip web-pages with extensions pdf, doc, zip, etc. from crawling/downloading.

Bhushan Lodha
  • 6,824
  • 7
  • 62
  • 100

1 Answers1

8
ext = %w(flv swf png jpg gif asx zip rar tar 7z gz jar js css dtd xsd ico raw mp3 mp4 wav wmv ape aac ac3 wma aiff mpg mpeg avi mov ogg mkv mka asx asf mp2 m1v m3u f4v pdf doc xls ppt pps bin exe rss xml)

Anemone.crawl(url) do |anemone|

    anemone.skip_links_like /\.#{ext.join('|')}$/

    ...

end
sunnyrjuneja
  • 6,033
  • 2
  • 32
  • 51
  • 2
    You should anchor your regexp to the end otherwise an url like `http://example.org/how-to-generate-pdf.html` would be skipped. Also the dot should be escaped. How about `ext = %w(pdf doc etc ...)` and `anemone.skip_links_like /\.#{ext.join('|')}$/` – Fabio Dec 01 '11 at 22:14
  • Thanks Fabio, I'll make those changes now. – sunnyrjuneja Dec 01 '11 at 22:19
  • Fabio, if I may ask you a question. If you wanted to create a regex to skip a URL ending with digits say like http://www.somewebsite.com/this/a/test?page=21095925, how would you change the pattern? – sunnyrjuneja Dec 01 '11 at 22:49
  • It depends on your actual needs. For an url which ends with digits you can use `/\d+$/` but this is pretty general and could match a lot of things, you can restrict by enforce the presence of a `?` in the input as in `/\?.*\d+$/`, this is less general but you can go on with your full requirements. You can find all modifiers and patterns [here](http://www.tutorialspoint.com/ruby/ruby_regular_expressions.htm) and a good tester [here](http://rubular.com/) – Fabio Dec 01 '11 at 23:07
  • I actually tried that Regex but it didn't work. Here is the exactly url I am trying to avoid http://HIDDEN.com/about_us/index.php?page=press_and_news&subpage=20060117 – sunnyrjuneja Dec 01 '11 at 23:09
  • I have just posted my on question: http://stackoverflow.com/questions/8349599/skipping-web-pages-that-in-a-series-of-digits-using-regex-in-ruby-anemone-web-cr – sunnyrjuneja Dec 01 '11 at 23:11
  • 1
    Thanks guys. @Sunny - add :skip_query_strings => true in your anemone options and it will solve your problem. – Bhushan Lodha Dec 02 '11 at 00:05