Scrape links from a lst in scrapy OR create a loop?

Question

I want to scrape this website: https://www.racingpost.com/results for the results.

I already have a crawler that scrapes and follows the links on the results page - but i can not go further back than the 6 or seven days that are displayed on the site. The older results are aviable via the "resultsfinder", which is sadly java script, as are other sources of the older races like the form of the horses.

I already tried to learn to scrape java to get the links, and while it is very interesting, I am wondering if there is not an easier way, as the result page adresses are designed in a very convinient way:

Its simply https://www.racingpost.com/results/ + something like 1990-02-08 or 2021-02-11 or any other date.

So I thought it might be easier to design the spider to scrape to get its links from a loop or predefined list of links.

How could I design a loop that runs through 1990-01-01 up to now in scrapy or is it better to create a predefined list of links for this?

Toivo Mattila · Accepted Answer · 2021-02-16T12:51:35.800

Generate the dates in the spider and append them to the link, no need to create a predefined list of links.

from datetime import date, timedelta

# Initialize variables
start_date = date(1990, 1, 1)
end_date = date.today()
crawl_date = start_date
base_url = "https://www.racingpost.com/results/"
links = []
# Generate the links
while crawl_date <= end_date:
    links.append(base_url + str(crawl_date))
    crawl_date += timedelta(days=1)

Then loop through the generated list, or alternatively just call the parse function from the while-loop instead of adding the links to a list.

Example results:

>>> links
[
    "https://www.racingpost.com/results/1990-01-01",
    "https://www.racingpost.com/results/1990-01-02",
    "https://www.racingpost.com/results/1990-01-03",
    "https://www.racingpost.com/results/1990-01-04",
    "https://www.racingpost.com/results/1990-01-05",
    ...
]

Scrape links from a lst in scrapy OR create a loop?

1 Answers1