4

I am working on a web scraper, but I have stumbled across this weird behavior when using a string placeholder in a list comprehension (here is a snippet of my code from Pycharm):

# -*- coding: utf-8 -*-
from arms_transfers.items import ArmsTransferItem
import itertools
import pycountry
import scrapy
import urllib3


class UnrocaSpider(scrapy.Spider):
    name = 'unroca'
    allowed_domains = ['unroca.org']

    country_names = [country.official_name if hasattr(country, 'official_name')
                     else country.name for country in list(pycountry.countries)]
    country_names = [name.lower().replace(' ', '-') for name in country_names]

    base_url = 'https://www.unroca.org/{}/report/{}/'
    url_param_tuples = list(itertools.product(country_names, range(2010, 2017)))
    start_urls = [base_url.format(param_tuple[0], param_tuple[1]) for param_tuple in url_param_tuples]

Here is the error:

Traceback (most recent call last):
  File "anaconda3/envs/scraper/bin/scrapy", line 11, in <module>
    sys.exit(execute())
  File "anaconda3/envs/scraper/lib/python3.6/site-packages/scrapy/cmdline.py", line 148, in execute
    cmd.crawler_process = CrawlerProcess(settings)
  File "anaconda3/envs/scraper/lib/python3.6/site-packages/scrapy/crawler.py", line 243, in __init__
    super(CrawlerProcess, self).__init__(settings)
  File "anaconda3/envs/scraper/lib/python3.6/site-packages/scrapy/crawler.py", line 134, in __init__
    self.spider_loader = _get_spider_loader(settings)
  File "/anaconda3/envs/scraper/lib/python3.6/site-packages/scrapy/crawler.py", line 330, in _get_spider_loader
    return loader_cls.from_settings(settings.frozencopy())
  File "anaconda3/envs/scraper/lib/python3.6/site-packages/scrapy/spiderloader.py", line 61, in from_settings
    return cls(settings)
  File "anaconda3/envs/scraper/lib/python3.6/site-packages/scrapy/spiderloader.py", line 25, in __init__
    self._load_all_spiders()
  File "anaconda3/envs/scraper/lib/python3.6/site-packages/scrapy/spiderloader.py", line 47, in _load_all_spiders
    for module in walk_modules(name):
  File "anaconda3/envs/scraper/lib/python3.6/site-packages/scrapy/utils/misc.py", line 71, in walk_modules
    submod = import_module(fullpath)
  File "anaconda3/envs/scraper/lib/python3.6/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 994, in _gcd_import
  File "<frozen importlib._bootstrap>", line 971, in _find_and_load
  File "<frozen importlib._bootstrap>", line 955, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 665, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 678, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "Programming/my_projects/web-scrapers/arms_transfers/arms_transfers/spiders/unroca.py", line 9, in <module>
    class UnrocaSpider(scrapy.Spider):
  File "Programming/my_projects/web-scrapers/arms_transfers/arms_transfers/spiders/unroca.py", line 19, in UnrocaSpider
    start_urls = [base_url.format(param_tuple[0], param_tuple[1]) for param_tuple in url_param_tuples]
  File "Programming/my_projects/web-scrapers/arms_transfers/arms_transfers/spiders/unroca.py", line 19, in <listcomp>
    start_urls = [base_url.format(param_tuple[0], param_tuple[1]) for param_tuple in url_param_tuples]
NameError: name 'base_url' is not defined

Weirdly though, when I run this in Jupyter notebook:

import pycountry
import itertools

country_names = [country.official_name if hasattr(country, 'official_name')
                     else country.name for country in list(pycountry.countries)]
country_names = [name.lower().replace(' ', '-') for name in country_names]

base_url = 'https://www.unroca.org/{}/report/{}/'
url_param_tuples = list(itertools.product(country_names, range(2010, 2017)))
start_urls = [base_url.format(param_tuple[0], param_tuple[1]) for param_tuple in url_param_tuples]

It works just as I would expect it to in the Pycharm project:

 ['https://www.unroca.org/aruba/report/2010/',
 'https://www.unroca.org/aruba/report/2011/',
 'https://www.unroca.org/aruba/report/2012/',
 'https://www.unroca.org/aruba/report/2013/',
 'https://www.unroca.org/aruba/report/2014/',
 'https://www.unroca.org/aruba/report/2015/',
 'https://www.unroca.org/aruba/report/2016/',
 'https://www.unroca.org/islamic-republic-of-afghanistan/report/2010/',
 'https://www.unroca.org/islamic-republic-of-afghanistan/report/2011/',
 'https://www.unroca.org/islamic-republic-of-afghanistan/report/2012/',
 'https://www.unroca.org/islamic-republic-of-afghanistan/report/2013/',...]

The Pycharm project and the Jupyter notebook are using the same conda environment and Python 3.6.3 interpreter. Can anyone offer insight into what could account for the behavior differences?

  • Is this a warning in the IDE or an actual error if you run the code from PyCharm? If it's an actual error, please copy and paste it here. Is there a chance that PyCharms is actually complaining about `pycountry` being missing rather than `base_url`, but the squiggly line (if that's what you're refering to) is in the wrong place? – Arthur Tacca Nov 29 '17 at 20:37
  • I have updated my question with the error I receive when running the spider from the command line. – Bryce Freshcorn Nov 29 '17 at 20:41
  • 1
    That error refers to "line 19", but the code in your question is not 19 lines long. That might sound pedantic but I'm 100% sure the error you're getting is partly because of code that you haven't told us about. Please edit to include a snippet of code and an error you get from running EXACTLY that code. – Arthur Tacca Nov 29 '17 at 20:45
  • BTW, from that error I am suspicious that you have put your code in a class but not in a method of a class, which does not do what you expect. (I'm not sure what you would expect but it doesn't do anything sensible.) But of course I'm just guessing because I can't see all your code. – Arthur Tacca Nov 29 '17 at 20:48
  • Just updated the code to include all lines up to line 19. The error listed is the same as running the command through `scrapy crawl`. – Bryce Freshcorn Nov 29 '17 at 20:48
  • After all that I actually don't have time to post a proper answer! I'm AFK in 10 seconds. But you don't normally write code directly in classes, it goes in methods of classes, and `base_url` should be referred to as e.g. `self.base_url` (what is `self? well it doesn't exist in your code because you're not using classes properly). – Arthur Tacca Nov 29 '17 at 20:57
  • Is there a better way to do this in `scrapy`? I am just trying to generate URLs for the spider to crawl. `start_urls` is a list of URLs that will be pulled by the spider for processing in a parsing function. That parsing function is the only function I need to work with in this class. – Bryce Freshcorn Nov 29 '17 at 21:03
  • List comprehensions at class scope basically just don't work. – user2357112 Nov 29 '17 at 21:15

1 Answers1

1

To answer my own question, if you need to generate your own list of starting URLs for scrapy.Spider classes, you should overwrite scrapy.Spider.start_requests(self). In my case, this would look like:

class UnrocaSpider(scrapy.Spider):
    name = 'unroca'
    allowed_domains = ['unroca.org']

    def start_requests(self):
        country_names = [country.official_name if hasattr(country, 'official_name')
                         else country.name for country in list(pycountry.countries)]
        country_names = [name.lower().replace(' ', '-') for name in country_names]

        base_url = 'https://www.unroca.org/{}/report/{}/'
        url_param_tuples = list(itertools.product(country_names, range(2010, 2017)))
        start_urls = [base_url.format(param_tuple[0], param_tuple[1]) for param_tuple in url_param_tuples]
        for url in start_urls:
            yield scrapy.Request(url, self.parse)