How to get all pages from the whole website using python?

Question

I am trying to make a tool that should get every link from website. For example I need to get all questions pages from stackoverflow. I tried using scrapy.

class MySpider(CrawlSpider):
    name = 'myspider'
    start_urls = ['https://stackoverflow.com/questions/']

    def parse(self, response):
        le = LinkExtractor()
        for link in le.extract_links(response):
            url_lnk = link.url
            print (url_lnk)

Here I got only questions from start page. What I need to do to get all 'question' links. Time doesn't matter, I just need to understand what to do.

UPD

The site which I want to observe is https://sevastopol.su/ - this is a local city news website.

The list of all news should be containde here: https://sevastopol.su/all-news

In the bottom of this page you can see page numbers, but if we go to the last page of news we will see that it has number 765 (right now, 19.06.2019) but it shows the last new with a date of 19 June 2018. So the last page shows only the one-year old news. But there are also plenty of news links that are still alive (probably from 2010 year) and can be even found in search page of this site. So that is why I wanted to know if there can be an access to some global link store of this site.

Once you know how to dig out the earliest links manually, update your question including those steps so that we can take care of them automatically using scrapy. — SIM, Jun 19 '19 at 09:51
@SIM previously I could loop them using page numbers. If I put the number larger than the last page it will not show earlier pages. — Pavel Pereverzev, Jun 19 '19 at 09:59
If you check out this link `https://sevastopol.su/all-news?page=765` and this one `https://sevastopol.su/all-news?page=1000`, you will notice that they both contain the same thing. Turn out that the page number bigger than 765 is just a placeholder and will redirect you to the page contains 765 automatically. Hope this helps. — SIM, Jun 19 '19 at 10:57
@SIM I actually mentioned it in an update of my question. The whole news list shoul d contain more than 6000 pages of news — Pavel Pereverzev, Jun 19 '19 at 11:24

score 3 · Answer 1 · answered Jun 19 '19 at 09:26

3

This is something you might wanna do to get all the links to the different questions asked. However, I thing your script might get 404 error somewhere within the execution as there are millions links to parse.

Run the script just the way it is:

import scrapy

class StackOverflowSpider(scrapy.Spider):
    name = 'stackoverflow'
    start_urls = ["https://stackoverflow.com/questions/"]

    def parse(self, response):
        for link in response.css('.summary .question-hyperlink::attr(href)').getall():
            post_link = response.urljoin(link)
            yield {"link":post_link}

        next_page = response.css("a[rel='next']::attr(href)").get()
        if next_page:
            next_page_url = response.urljoin(next_page)
            yield scrapy.Request(next_page_url,callback=self.parse)

answered Jun 19 '19 at 09:26

SIM

21,997
5
37
109

looks like a loop throug the page numbers of stackoverlow. I tried it, but it wasn't helpful in my case. Added details in question. – Pavel Pereverzev Jun 19 '19 at 09:41
OP - too many requests too quickly to SO will incur a temporary ban. + for the answer though. – QHarr Jun 19 '19 at 09:58
Right you are @QHarr. This is really the hell of a thing to do. – SIM Jun 19 '19 at 10:06
Yah. I made the mistake of using automated checking of my favourites for SO links == ban! – QHarr Jun 19 '19 at 10:13

score 0 · Answer 2 · answered Jun 19 '19 at 09:12

0

You should write a regular expression (or a similar search function) that looks for <a> tags with a specific class (in the case of so: class="question-hyperlink") and take the href attribute from those elements. This will fetch all the links from the current page.

Then you can also search for the page links (at the bottom). Here you see that those links are /questions?sort=active&page=<pagenumber> where you can change <pagenumber> with the page you want to scrape. (e.g. make a loop that starts at 1 and goes on until you get a 404 error.

answered Jun 19 '19 at 09:12

wohe1

755
7
26

actually I used page numbers already on the other site. I needed to get news and comments there but the last page number contained only one-year old news. However there are still older news that can be found there using search. – Pavel Pereverzev Jun 19 '19 at 09:18
1

then maybe this can help? https://stackoverflow.com/questions/29433422/how-to-get-a-list-of-questions-from-stackoverflow-api-based-on-search-query – wohe1 Jun 19 '19 at 09:21
It could help if site that I wanted to observe had an API – Pavel Pereverzev Jun 19 '19 at 09:49

Raphael · Answer 3 · 2019-06-19T15:42:02.000

your spider which now yields requests to crawl subsequent pages

from scrapy.spiders import CrawlSpider
from scrapy import Request
from urllib.parse import urljoin

class MySpider(CrawlSpider):
    name = 'myspider'
    start_urls = ['https://sevastopol.su/all-news']

    def parse(self, response):
        # This method is called for every successfully crawled page

        # get all pagination links using xpath
        for link in response.xpath("//li[contains(@class, 'pager-item')]/a/@href").getall():
            # build the absolute url 
            url = urljoin('https://sevastopol.su/', link)
            print(url)
            yield Request(url=url, callback=self.parse)  # <-- This makes your spider recursiv crawl subsequent pages

note that you don't have to worry about requesting the same url multiple times. Duplicates are dropped by scrapy (default settings).

Next steps:

Configure Scrapy (e.g User Agent, Crawl Delay, ...): https://docs.scrapy.org/en/latest/topics/settings.html
Handle Errors (errback): https://docs.scrapy.org/en/latest/topics/request-response.html
Use Item Piplines to store your URLs etc.: https://docs.scrapy.org/en/latest/topics/item-pipeline.html

How to get all pages from the whole website using python?

3 Answers3

Linked