scrapy re.match not working find urls in string using regex

Question

I try crawl many url in the same domain. I have to url list in the string. I want to search regex in string and find urls. But re.match() always return none. I test my regex and it working. This is my code:

# -*- coding: UTF-8 -*-

import scrapy
import codecs 
import re

from scrapy.contrib.spiders import CrawlSpider, Rule

from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

from scrapy import Request

from scrapy.selector import HtmlXPathSelector

from hurriyet.items import HurriyetItem

class hurriyet_spider(CrawlSpider):
    name = 'hurriyet'
    allowed_domains = ['hurriyet.com.tr']
    start_urls = ['http://www.hurriyet.com.tr/gundem/']

    rules = (Rule(SgmlLinkExtractor(allow=('\/gundem(\/\S*)?.asp$')),'parse',follow=True),) 

    def parse(self, response):
        image = HurriyetItem()
        text =  response.xpath("//a/@href").extract()
        print text

        urls = ''.join(text)


        page_links = re.match("(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'\".,<>?«»“”‘’]))", urls, re.M)

        image['title'] = response.xpath("//h1[@class = 'title selectionShareable'] | //h1[@itemprop = 'name']/text()").extract()
        image['body'] = response.xpath("//div[@class = 'detailSpot']").extract()
        image['body2'] = response.xpath("//div[@class = 'ctx_content'] ").extract()
        print page_links

        return image, text

Use `re.findall`. `re.match` only matches at the beginning of a string. — Wiktor Stribiżew, Apr 30 '15 at 07:11
i try it not working again re.match() return none re.findall() return [] — Kerim Caner Tümkaya, Apr 30 '15 at 07:15
That means your regex is at fault. Does the regex from this post help: http://stackoverflow.com/questions/1141848/regex-to-match-url ? There is another for you to check: https://mathiasbynens.be/demo/url-regex — Wiktor Stribiżew, Apr 30 '15 at 07:47

score 0 · Accepted Answer · answered Apr 30 '15 at 07:16

0

There is no need to use the re module, Scrapy selectors have a built in feature for regex filtering:

def parse(self, response):
        ...
        page_links = response.xpath("//a/@href").re('your_regex_expression')
        ...

With that said, I suggest you play with this approach in the Scrapy shell first to make sure your regex is indeed working. Because I wouldn't expect people to try to debug a mile long regex - it's basically a write only language :)

answered Apr 30 '15 at 07:16

bosnjak

8,424
2
21
47

Hey, why `write-only`? :) Have a look at the question I just answered: http://stackoverflow.com/questions/29960796/regex-is-not-working-for-some-cases-php/29961067#29961060. Regexes are not so unreadable. – Wiktor Stribiżew Apr 30 '15 at 07:17
It's a joke on how ugly they are for reading, but more fluent when writing. – bosnjak Apr 30 '15 at 07:19
You jest. The regexp above is only _half_ a mile long, and sites like [regex101.com](http://regex101.com/) can turn it into beautiful abstract art. – lcd047 Apr 30 '15 at 08:01
@lcd047: I'm not the one to compete in who-has-a-bigger-regex, so I didn't actually measure, it was more of an quick approximation :D – bosnjak Apr 30 '15 at 08:52

scrapy re.match not working find urls in string using regex

1 Answers1