String formatting in a list comprehension

Question

I am working on a web scraper, but I have stumbled across this weird behavior when using a string placeholder in a list comprehension (here is a snippet of my code from Pycharm):

# -*- coding: utf-8 -*-
from arms_transfers.items import ArmsTransferItem
import itertools
import pycountry
import scrapy
import urllib3


class UnrocaSpider(scrapy.Spider):
    name = 'unroca'
    allowed_domains = ['unroca.org']

    country_names = [country.official_name if hasattr(country, 'official_name')
                     else country.name for country in list(pycountry.countries)]
    country_names = [name.lower().replace(' ', '-') for name in country_names]

    base_url = 'https://www.unroca.org/{}/report/{}/'
    url_param_tuples = list(itertools.product(country_names, range(2010, 2017)))
    start_urls = [base_url.format(param_tuple[0], param_tuple[1]) for param_tuple in url_param_tuples]

Here is the error:

Traceback (most recent call last):
  File "anaconda3/envs/scraper/bin/scrapy", line 11, in <module>
    sys.exit(execute())
  File "anaconda3/envs/scraper/lib/python3.6/site-packages/scrapy/cmdline.py", line 148, in execute
    cmd.crawler_process = CrawlerProcess(settings)
  File "anaconda3/envs/scraper/lib/python3.6/site-packages/scrapy/crawler.py", line 243, in __init__
    super(CrawlerProcess, self).__init__(settings)
  File "anaconda3/envs/scraper/lib/python3.6/site-packages/scrapy/crawler.py", line 134, in __init__
    self.spider_loader = _get_spider_loader(settings)
  File "/anaconda3/envs/scraper/lib/python3.6/site-packages/scrapy/crawler.py", line 330, in _get_spider_loader
    return loader_cls.from_settings(settings.frozencopy())
  File "anaconda3/envs/scraper/lib/python3.6/site-packages/scrapy/spiderloader.py", line 61, in from_settings
    return cls(settings)
  File "anaconda3/envs/scraper/lib/python3.6/site-packages/scrapy/spiderloader.py", line 25, in __init__
    self._load_all_spiders()
  File "anaconda3/envs/scraper/lib/python3.6/site-packages/scrapy/spiderloader.py", line 47, in _load_all_spiders
    for module in walk_modules(name):
  File "anaconda3/envs/scraper/lib/python3.6/site-packages/scrapy/utils/misc.py", line 71, in walk_modules
    submod = import_module(fullpath)
  File "anaconda3/envs/scraper/lib/python3.6/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 994, in _gcd_import
  File "<frozen importlib._bootstrap>", line 971, in _find_and_load
  File "<frozen importlib._bootstrap>", line 955, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 665, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 678, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "Programming/my_projects/web-scrapers/arms_transfers/arms_transfers/spiders/unroca.py", line 9, in <module>
    class UnrocaSpider(scrapy.Spider):
  File "Programming/my_projects/web-scrapers/arms_transfers/arms_transfers/spiders/unroca.py", line 19, in UnrocaSpider
    start_urls = [base_url.format(param_tuple[0], param_tuple[1]) for param_tuple in url_param_tuples]
  File "Programming/my_projects/web-scrapers/arms_transfers/arms_transfers/spiders/unroca.py", line 19, in <listcomp>
    start_urls = [base_url.format(param_tuple[0], param_tuple[1]) for param_tuple in url_param_tuples]
NameError: name 'base_url' is not defined

Weirdly though, when I run this in Jupyter notebook:

import pycountry
import itertools

country_names = [country.official_name if hasattr(country, 'official_name')
                     else country.name for country in list(pycountry.countries)]
country_names = [name.lower().replace(' ', '-') for name in country_names]

base_url = 'https://www.unroca.org/{}/report/{}/'
url_param_tuples = list(itertools.product(country_names, range(2010, 2017)))
start_urls = [base_url.format(param_tuple[0], param_tuple[1]) for param_tuple in url_param_tuples]

It works just as I would expect it to in the Pycharm project:

 ['https://www.unroca.org/aruba/report/2010/',
 'https://www.unroca.org/aruba/report/2011/',
 'https://www.unroca.org/aruba/report/2012/',
 'https://www.unroca.org/aruba/report/2013/',
 'https://www.unroca.org/aruba/report/2014/',
 'https://www.unroca.org/aruba/report/2015/',
 'https://www.unroca.org/aruba/report/2016/',
 'https://www.unroca.org/islamic-republic-of-afghanistan/report/2010/',
 'https://www.unroca.org/islamic-republic-of-afghanistan/report/2011/',
 'https://www.unroca.org/islamic-republic-of-afghanistan/report/2012/',
 'https://www.unroca.org/islamic-republic-of-afghanistan/report/2013/',...]

The Pycharm project and the Jupyter notebook are using the same conda environment and Python 3.6.3 interpreter. Can anyone offer insight into what could account for the behavior differences?

Is this a warning in the IDE or an actual error if you run the code from PyCharm? If it's an actual error, please copy and paste it here. Is there a chance that PyCharms is actually complaining about `pycountry` being missing rather than `base_url`, but the squiggly line (if that's what you're refering to) is in the wrong place? — Arthur Tacca, Nov 29 '17 at 20:37
I have updated my question with the error I receive when running the spider from the command line. — Bryce Freshcorn, Nov 29 '17 at 20:41
That error refers to "line 19", but the code in your question is not 19 lines long. That might sound pedantic but I'm 100% sure the error you're getting is partly because of code that you haven't told us about. Please edit to include a snippet of code and an error you get from running EXACTLY that code. — Arthur Tacca, Nov 29 '17 at 20:45
BTW, from that error I am suspicious that you have put your code in a class but not in a method of a class, which does not do what you expect. (I'm not sure what you would expect but it doesn't do anything sensible.) But of course I'm just guessing because I can't see all your code. — Arthur Tacca, Nov 29 '17 at 20:48
Just updated the code to include all lines up to line 19. The error listed is the same as running the command through `scrapy crawl`. — Bryce Freshcorn, Nov 29 '17 at 20:48
After all that I actually don't have time to post a proper answer! I'm AFK in 10 seconds. But you don't normally write code directly in classes, it goes in methods of classes, and `base_url` should be referred to as e.g. `self.base_url` (what is `self? well it doesn't exist in your code because you're not using classes properly). — Arthur Tacca, Nov 29 '17 at 20:57
Is there a better way to do this in `scrapy`? I am just trying to generate URLs for the spider to crawl. `start_urls` is a list of URLs that will be pulled by the spider for processing in a parsing function. That parsing function is the only function I need to work with in this class. — Bryce Freshcorn, Nov 29 '17 at 21:03
List comprehensions at class scope basically just don't work. — user2357112, Nov 29 '17 at 21:15

score 1 · Accepted Answer · answered Nov 29 '17 at 21:13

To answer my own question, if you need to generate your own list of starting URLs for scrapy.Spider classes, you should overwrite scrapy.Spider.start_requests(self). In my case, this would look like:

class UnrocaSpider(scrapy.Spider):
    name = 'unroca'
    allowed_domains = ['unroca.org']

    def start_requests(self):
        country_names = [country.official_name if hasattr(country, 'official_name')
                         else country.name for country in list(pycountry.countries)]
        country_names = [name.lower().replace(' ', '-') for name in country_names]

        base_url = 'https://www.unroca.org/{}/report/{}/'
        url_param_tuples = list(itertools.product(country_names, range(2010, 2017)))
        start_urls = [base_url.format(param_tuple[0], param_tuple[1]) for param_tuple in url_param_tuples]
        for url in start_urls:
            yield scrapy.Request(url, self.parse)

String formatting in a list comprehension

1 Answers1