Ruby+Anemone Web Crawler: regex to match URLs ending in a series of digits

Question

Suppose I was trying crawl a website a skip a page that ended like so:

http://HIDDENWEBSITE.com/anonimize/index.php?page=press_and_news&subpage=20060117

I am currently using Anemone gem in Ruby to build the crawler. I am using the skip_links_like method but my pattern never seems to match. I am trying to make this as generic as possible so it isn't dependent on subpage but just =2105925 (the digits).

I have tried /=\d+$/ and /\?.*\d+$/ but it doesn't seem to be working.

This similar to Skipping web-pages with extension pdf, zip from crawling in Anemone but I can't make it worth with digits instead of extensions.

Also, testing on http://regexpal.com/ with the pattern =\d+$ will successfully match http://misc.com/test/index.php?page=news&subpage=20060118

EDIT:

Here is the entirety of my code. I wonder if anyone can see exactly what's wrong.

require 'anemone'
...
Anemone.crawl(url, :depth_limit => 3, :obey_robots_txt => true) do |anemone|
  anemone.skip_links_like /\?.*\d+$/
  anemone.on_every_page do |page|
    pURL = page.url.to_s
    puts "Now checking: " + pURL
    bestGuess[pURL] = match_freq( manList, page.doc.inner_text )
    puts "Successfully checked"
  end
end

My output something like this:

...
Now checking: http://MISC.com/about_us/index.php?page=press_and_news&subpage=20110711
Successfully checked
...

score 3 · Accepted Answer · answered Dec 02 '11 at 00:03

3

  Anemone.crawl(url, :depth_limit => 3, :obey_robots_txt => true, :skip_query_strings => true) do |anemone|
   anemone.on_every_page do |page|
     pURL = page.url.to_s
     puts "Now checking: " + pURL
      bestGuess[pURL] = match_freq( manList, page.doc.inner_text )
     puts "Successfully checked"
   end
 end

answered Dec 02 '11 at 00:03

Bhushan Lodha

6,824
7
62
100

This worked perfectly, thanks! Although, it is a little skip heavy! Some valid pages came up as query strings. Should I rewrite the code in the class? – sunnyrjuneja Dec 02 '11 at 00:10
When I turn on remove query strings it removes http://MISC.com/ANON/index.php?page=code_of_ethics and http://MISC.com/about/index.php?page=press_and_news&subpage=20110907. I want it to crawl the former but not the latter. I only want it to skip pages with digits in the end. – sunnyrjuneja Dec 03 '11 at 06:14

score 2 · Answer 2 · answered Dec 01 '11 at 23:17

2

Actually the /\?.*\d+$/ works:

~> irb
> all systems are go wirble/hirb/ap/show <
ruby-1.9.2-p180 :001 > "http://hiddenwebsite.com/anonimize/index.php?page=press_and_news&subpage=20060117".match /\?.*\d+$/
 => #<MatchData "?page=press_and_news&subpage=20060117">

answered Dec 01 '11 at 23:17

Fabio

18,856
9
82
114

This must be an issue with my code otherwise. I can't seem to get it working. – sunnyrjuneja Dec 01 '11 at 23:19

Ruby+Anemone Web Crawler: regex to match URLs ending in a series of digits

2 Answers2

Linked