ScrapingHub Environment Variables Not Loaded

Question

I'm deploying a bunch of spiders on ScrapingHub. The spider itself is working. I would like to change the feed output depending on whether the spider is running locally or on ScrapingHub (if it is running locally then output to a temp folder, if it is running on ScrapingHub output to S3). The idea is that I can use an environment variable as a switch between the two. However, attempting to print the environment variables from settings.py (in the Scrapy project) that I've set using the ScrapingHub interface returns None. I have a code snippet below that shows what I attempted to do.

Strangely enough, if I default the feed to S3 only (no switching based on env vars), the S3 upload works. The S3 credentials are loaded in using environment variables as well. Attempting to print them also returns None. Changing the AWS keys however, causes the upload to fail, so the values are getting through to Scrapy at some point in time, just perhaps not when the file is initially loaded. Setting the environment variables on the project level or the spider level did not change anything.

My question is, what is the correct way to use environment variables in a Scrapy project that is to be deployed on ScrapingHub?

# In settings.py

# Get the AWS credentials from the environment
AWS_ACCESS_KEY_ID = os.environ.get('AWS_ACCESS_KEY_ID')
AWS_SECRET_ACCESS_KEY = os.environ.get('AWS_SECRET_ACCESS_KEY')

# Print the credentials
print(AWS_ACCESS_KEY_ID)      # Outputs None
print(AWS_SECRET_ACCESS_KEY)  # Outputs None

FEED_URI = 's3://my-bucket/%(name)s/{}.json'.format(today.strftime('%Y-%m-%d'))
FEED_FORMAT = 'json'

EDIT:

I've found a support ticket on ScrapingHub where an identical issue is being presented. The problem seems to be with the order in which the settings from the UI interface are being overwritten. There does not appear to be any documentation about this. In addition, the S3 problem goes away with the scrapy:1.4 stack. Using the latest scrapy:1.6 stack causes the problem to appear. There is still not yet a satisfactory solution.

I'm having the same problem, did you ever figure it out? – Denzel Hooke Dec 31 '21 at 23:42 — Denzel Hooke, Dec 31 '21 at 23:42

score 0 · Answer 1 · answered Jun 18 '19 at 05:19

You can check boto3 library in python

Sample code

import boto3

# s3 region
sts_client = boto3.client('sts')
assumed_role_object = sts_client.assume_role(
    RoleArn="Roleproperty",
    RoleSessionName="SessionProperty"
)

#s3 cli credentials
credentials = assumed_role_object['Credentials']


aws_access_key_id=credentials['AccessKeyId']
aws_secret_access_key=credentials['SecretAccessKey']
aws_session_token=credentials['SessionToken']

Reference

https://dluo.me/s3databoto3 https://boto3.amazonaws.com/v1/documentation/api/latest/guide/configuration.html

The issue isn't so much that I can't connect to S3, it's that the environment variables that are supposed to be set by ScrapingHub are nowhere to be found — Ze Xuan, Jun 18 '19 at 05:27

ScrapingHub Environment Variables Not Loaded

1 Answers1