0

I want to use two rules in the spider and make them logical OR (||) together.

The code is as follows:

for urlrule in urlrules:
    if urlrule['rule'] is not 'nan':
        allSpider.rules = [Rule(LinkExtractor(allow=(urlrule['rule'],), ), callback="parse_items", follow=True)]
    elif urlrule['restrictXP'] is not 'nan':
        allSpider.rules = [Rule(LinkExtractor(restrict_xpaths=urlrule['restrictXP']), callback='parse_items', follow=True)]
    else:
        print('Undefined Rule!')
        break

if urlrule['rule'] is not 'nan' This section is read on a csv file.

But there is a problem, only the first part of the if is examined. And when I run it, it returns the following:

Unhandled error in Deferred:
2018-09-30 13:18:58 [twisted] CRITICAL: 
Unhandled error in Deferred:

2018-09-30 13:18:58 [twisted] CRITICAL: 
Traceback (most recent call last):
File "/home/reyhaneh/.local/lib/python2.7/site-         packages/twisted/internet/defer.py", line 1386, in   _inlineCallbacks
result = g.send(result)
File "/home/reyhaneh/.local/lib/python2.7   /site-packages/scrapy/crawler.py", line 98, in crawl
six.reraise(*exc_info)
File "/home/reyhaneh/.local/lib/python2.7/site-   packages/scrapy/crawler.py", line 79, in crawl
self.spider = self._create_spider(*args,    **kwargs)
File "/home/reyhaneh/.local/lib/python2.7/site-   packages/scrapy/crawler.py", line 102, in   _create_spider
return self.spidercls.from_crawler(self,    *args, **kwargs)
File "/home/reyhaneh/.local/lib/python2.7   /site-packages/scrapy/spiders/crawl.py", line 100,     in from_crawler
spider = super(CrawlSpider,    cls).from_crawler(crawler, *args, **kwargs)
File "/home/reyhaneh/.local/lib/python2.7  /site-packages/scrapy/spiders/__init__.py", line 51,    in from_crawler
spider = cls(*args, **kwargs)
File "/home/reyhaneh/PycharmProjects/total  /total.py", line 25, in __init__
allSpider.rules = [Rule(LinkExtractor(allow=   (urlrule['rule'],), ), callback="parse_items",    follow=True)]
File "/home/reyhaneh/.local/lib/python2.7  /site-packages/scrapy/linkextractors/lxmlhtml.py",    line 116, in __init__
canonicalize=canonicalize,     deny_extensions=deny_extensions)
File "/home/reyhaneh/.local/lib/python2.7/site-packages/scrapy/linkextractors/__init__.py",    line 57, in __init__
for x in arg_to_iter(allow)]
File "/usr/lib/python2.7/re.py", line 194, in   compile
return _compile(pattern, flags)
File "/usr/lib/python2.7/re.py", line 247, in _compile
raise TypeError, "first argument must be string or compiled pattern"
TypeError: first argument must be string or compiled pattern

How can I fix it?

Eugene Primako
  • 2,767
  • 9
  • 26
  • 35
  • In python `eq1 or eq2` if eq1 is true it will not execute the eq2. for check – Ankur Jyoti Phukan Sep 30 '18 at 11:00
  • 1
    Your string comparison is maybe not what you want: You compare objects with the `is` operator and not the string contents. So instead of `urlrule['rule'] is not 'nan'` you would want `urlrule['rule'] != 'nan'`. See also https://stackoverflow.com/questions/1504717/why-does-comparing-strings-in-python-using-either-or-is-sometimes-produce – Merlin1896 Sep 30 '18 at 11:02
  • @Merlin1896 Thanks for this reminder but there is still a problem – Reyhaneh Khalili Sep 30 '18 at 11:55
  • LinkExtractor is expecting a regular expression (or list of) for `allow`. What is the value of `urlrule['rule']`? – jschnurr Sep 30 '18 at 12:36
  • https://stackoverflow.com/users/4108524/jschnurr urlrule['rule'] It calls from another function and Reads the csv file and embeds the rule variable in it – Reyhaneh Khalili Sep 30 '18 at 12:56

0 Answers0