I'm deploying a bunch of spiders on ScrapingHub. The spider itself is working. I would like to change the feed output depending on whether the spider is running locally or on ScrapingHub (if it is running locally then output to a temp folder, if it is running on ScrapingHub output to S3). The idea is that I can use an environment variable as a switch between the two. However, attempting to print the environment variables from settings.py
(in the Scrapy project) that I've set using the ScrapingHub interface returns None
. I have a code snippet below that shows what I attempted to do.
Strangely enough, if I default the feed to S3 only (no switching based on env vars), the S3 upload works. The S3 credentials are loaded in using environment variables as well. Attempting to print them also returns None
. Changing the AWS keys however, causes the upload to fail, so the values are getting through to Scrapy at some point in time, just perhaps not when the file is initially loaded. Setting the environment variables on the project level or the spider level did not change anything.
My question is, what is the correct way to use environment variables in a Scrapy project that is to be deployed on ScrapingHub?
# In settings.py
# Get the AWS credentials from the environment
AWS_ACCESS_KEY_ID = os.environ.get('AWS_ACCESS_KEY_ID')
AWS_SECRET_ACCESS_KEY = os.environ.get('AWS_SECRET_ACCESS_KEY')
# Print the credentials
print(AWS_ACCESS_KEY_ID) # Outputs None
print(AWS_SECRET_ACCESS_KEY) # Outputs None
FEED_URI = 's3://my-bucket/%(name)s/{}.json'.format(today.strftime('%Y-%m-%d'))
FEED_FORMAT = 'json'
EDIT:
I've found a support ticket on ScrapingHub where an identical issue is being presented. The problem seems to be with the order in which the settings from the UI interface are being overwritten. There does not appear to be any documentation about this. In addition, the S3 problem goes away with the scrapy:1.4
stack. Using the latest scrapy:1.6
stack causes the problem to appear. There is still not yet a satisfactory solution.