I am using scrapy to crawl a website and extract data from it, scrapy uses regex-based rules to check if a page has to be parsed, or a link has to be followed.
I am implementing a resume feature for my spider, so it could continue crawling from the last visited page. For this, I get the last followed link from a database when the spider is launched.
My site urls look like http://foobar.com/page1.html
, so, usually, the rule's regex to follow every link like this would be something like /page\d+\.html
.
But how can I write a regex so it would match, for example, page 15 and more? Also, as I don't know the starting point in advance, how could I generate this regex at runtime?