I'm just starting to learn Scrapy and I have such a question. for my "spider" I have to take a list of urls (start_urls) from the google sheets table and I have this code:
import gspread
from oauth2client.service_account import ServiceAccountCredentials
scope = ['https://spreadsheets.google.com/feeds','https://www.googleapis.com/auth/drive']
creds = ServiceAccountCredentials.from_json_keyfile_name('token.json', scope)
client = gspread.authorize(creds)
sheet = client.open('Sheet_1')
sheet_instance = sheet.get_worksheet(0)
records_data = sheet_instance.col_values(col=2)
for link in records_data:
print(link)
........
How do I configure the middleware so that when the spider (scrappy crawl my_spider
) is launched, links from this code are automatically substituted into start_urls? perhaps i need to create a class in middlewares.py?
I will be grateful for any help, with examples.
it is necessary that this rule applies to all new spiders, generating a list from a file in start_requests (for example start_urls = [l.strip() for an open string('urls.txt ').readline()]
) is not convenient...