0

I try crawl many url in the same domain. I have to url list in the string. I want to search regex in string and find urls. But re.match() always return none. I test my regex and it working. This is my code:

# -*- coding: UTF-8 -*-

import scrapy
import codecs 
import re

from scrapy.contrib.spiders import CrawlSpider, Rule

from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

from scrapy import Request

from scrapy.selector import HtmlXPathSelector

from hurriyet.items import HurriyetItem

class hurriyet_spider(CrawlSpider):
    name = 'hurriyet'
    allowed_domains = ['hurriyet.com.tr']
    start_urls = ['http://www.hurriyet.com.tr/gundem/']

    rules = (Rule(SgmlLinkExtractor(allow=('\/gundem(\/\S*)?.asp$')),'parse',follow=True),) 

    def parse(self, response):
        image = HurriyetItem()
        text =  response.xpath("//a/@href").extract()
        print text

        urls = ''.join(text)


        page_links = re.match("(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'\".,<>?«»“”‘’]))", urls, re.M)

        image['title'] = response.xpath("//h1[@class = 'title selectionShareable'] | //h1[@itemprop = 'name']/text()").extract()
        image['body'] = response.xpath("//div[@class = 'detailSpot']").extract()
        image['body2'] = response.xpath("//div[@class = 'ctx_content'] ").extract()
        print page_links

        return image, text
bosnjak
  • 8,424
  • 2
  • 21
  • 47

1 Answers1

0

There is no need to use the re module, Scrapy selectors have a built in feature for regex filtering:

def parse(self, response):
        ...
        page_links = response.xpath("//a/@href").re('your_regex_expression')
        ...

With that said, I suggest you play with this approach in the Scrapy shell first to make sure your regex is indeed working. Because I wouldn't expect people to try to debug a mile long regex - it's basically a write only language :)

bosnjak
  • 8,424
  • 2
  • 21
  • 47
  • Hey, why `write-only`? :) Have a look at the question I just answered: http://stackoverflow.com/questions/29960796/regex-is-not-working-for-some-cases-php/29961067#29961060. Regexes are not so unreadable. – Wiktor Stribiżew Apr 30 '15 at 07:17
  • It's a joke on how ugly they are for reading, but more fluent when writing. – bosnjak Apr 30 '15 at 07:19
  • You jest. The regexp above is only _half_ a mile long, and sites like [regex101.com](http://regex101.com/) can turn it into beautiful abstract art. – lcd047 Apr 30 '15 at 08:01
  • @lcd047: I'm not the one to compete in who-has-a-bigger-regex, so I didn't actually measure, it was more of an quick approximation :D – bosnjak Apr 30 '15 at 08:52