1

I am using scrapy to crawl a website and extract data from it, scrapy uses regex-based rules to check if a page has to be parsed, or a link has to be followed.

I am implementing a resume feature for my spider, so it could continue crawling from the last visited page. For this, I get the last followed link from a database when the spider is launched.

My site urls look like http://foobar.com/page1.html, so, usually, the rule's regex to follow every link like this would be something like /page\d+\.html.

But how can I write a regex so it would match, for example, page 15 and more? Also, as I don't know the starting point in advance, how could I generate this regex at runtime?

eLRuLL
  • 18,488
  • 9
  • 73
  • 99
mdeous
  • 17,513
  • 7
  • 56
  • 60
  • 3
    See "[To use or not to use regular expressions?](http://stackoverflow.com/questions/4098086/to-use-or-not-to-use-regular-expressions/4098123#4098123)". –  Mar 05 '11 at 18:05
  • oh :( as i saw it was possible with Perl regexes, i thought there would be a way to achieve the same in python. – mdeous Mar 05 '11 at 18:18
  • 1
    @delnan - I agree that re's for this sound like a poor idea, but I think the OP is forced to use them by scrapy's API design, so not using them is not really an available option. – PaulMcG Mar 06 '11 at 00:00

4 Answers4

3

Why not group the page number, then check if it is qualified:

>>> m=re.match("/page(\d+)\.html","/page18.html")
>>> if m:
    ID=int(m.groups()[0])
>>> ID > 15
True

Or more specifically what you requested:

>>> def genRegex(n):
    return ''.join('[' + "0123456789"[int(d):] + ']' for d in str(n))

>>> genRegex(123)
'[123456789][23456789][3456789]'
Neuron
  • 5,141
  • 5
  • 38
  • 59
Kabie
  • 10,489
  • 1
  • 38
  • 45
  • `m.group(1)` is neater. Otherwise, exactly what I wanted to say with my comment -> +1 –  Mar 05 '11 at 18:10
  • i can't do this because the regex is not processed by my code, but by scrapy's rules engine, otherwise yes, it would have been the easiest solution. – mdeous Mar 05 '11 at 18:16
  • i'm sorry, it still doesn't fit my problem, for example genRegex(50) wouldn't match 150. But i start thinking regexes are not the solution for my problem, and i'll have to find something else to achieve what i want to. – mdeous Mar 05 '11 at 18:51
  • you should add a `[your stuff]|\d{'+str(len(str(d))+1)+',}`, for it to match numbers with more digits than the one given .. – Gabi Purcaru Mar 05 '11 at 19:53
2

Try this:

def digit_match_greater(n):
    digits = str(n)
    variations = []
    # Anything with more than len(digits) digits is a match:
    variations.append(r"\d{%d,}" % (len(digits)+1))
    # Now match numbers with len(digits) digits.
    # (Generate, e.g, for 15, "1[6-9]", "[2-9]\d")
    # 9s can be skipped -- e.g. for >19 we only need [2-9]\d.
    for i, d in enumerate(digits):
        if d != "9": 
            pattern = list(digits)
            pattern[i] = "[%d-9]" % (int(d) + 1)
            for j in range(i+1, len(digits)):
                pattern[j] = r"\d"
            variations.append("".join(pattern))
    return "(?:%s)" % "|".join("(?:%s)" % v for v in variations)

It turned out easier to make it match numbers greater than the parameter, so if you give it 15, it'll return a string for matching numbers 16 and greater, specifically...

(?:(?:\d{3,})|(?:[2-9]\d)|(?:1[6-9]))

You can then substitute this into your expression instead of \d+, like so:

exp = re.compile(r"page%s\.html" % digit_match_greater(last_page_visited))
Martin Stone
  • 12,682
  • 2
  • 39
  • 53
2

extending Kabie's answer a little:

def genregex(n):
    nstr = str(n)
    same_digit = ''.join('[' + "0123456789"[int(d):] + ']' for d in nstr)
    return "\d{%d,}|%s" % (len(nstr) + 1, same_digit)

It's easy to modify to handle leading 0's if that occurs in your website. But this seems like the wrong approach.

You have a few other options in scrapy. You're probably using SgmlLinkExtractor, in which case the easiest thing is to pass your own function as the process_value keyword argument to do your custom filtering.

You can customize CrawlSpider quite a lot, but if it doesn't fit your task, you should check out BaseSpider

Shane Evans
  • 2,234
  • 16
  • 15
0
>>> import regex
>>> import random
>>> n=random.randint(100,1000000)
>>> n
435220
>>> len(str(n))
>>> '\d'*len(str(n))
'\\d\\d\\d\\d\\d\\d'
>>> reg='\d{%d}'%len(str(n))
>>> m=re.search(reg,str(n))
>>> m.group(0)
'435220'
dawg
  • 98,345
  • 23
  • 131
  • 206
  • Fail: matches numbers less than n. >>> re.search(reg,str(n-1)) --> <_sre.SRE_Match object at 0x00AA7250> – Martin Stone Mar 05 '11 at 18:58
  • My intent was not to match the exact number, but any number of the same number of digits. The OP said "any number" not the exact number. If you want the exact number: `re.search(str(n),str-match-against)` – dawg Mar 06 '11 at 19:38
  • The question seems pretty unambiguous, even in the title alone: "from 'n' to infinite". – Martin Stone Mar 07 '11 at 08:18