0

I found some answers for a topic of how to extract all available links from any website and all of them were about scrapy module. ALso copied one of the code example:

from scrapy import Spider
from scrapy.linkextractors import LinkExtractor

class MySpider(Spider):
    name = 'myspider'
    start_urls = ['http://webpage.com']

    def parse(self, response):
        le = LinkExtractor()
        for link in le.extract_links(response):
            print (link)

But I need to launch it and get a simple python list of all html pages to get some information from them using urllib2 and bs4. How to launch this class correctly to get this list?

Pavel Pereverzev
  • 459
  • 1
  • 6
  • 21

1 Answers1

1

scrapy is a great tool to scrapy websites but it is more than just a snippet as you posted. What you posted is a spider definition. If embedded in a scrapy project your can run this spider e.g. in your terminal with scrapy crawl myspider.

Then your spider will visit http://webpage.com extract all its links and follow them recursively. Each url will be printed out but thats all. In order to store those links you can create so called items which then can be exported by a defined item pipeline. The hole thing is too complex to post it in a single answer. The bottom line is: yes, scrapy is a strong tool you can use for link extraction and the best point to start is with scrapy tutorials: https://docs.scrapy.org/en/latest/intro/tutorial.html

luckily the scrapy documentation is great :)

Raphael
  • 1,731
  • 2
  • 7
  • 23
  • thanks for an information. I run it using anaconda but in general I see that it crawls only within the main page of a website. To be clear, for example I want to get a list of ALL question pages in Stackoverflow. Despite the time of this process, is it possible in `scrapy`? – Pavel Pereverzev Jun 19 '19 at 08:29
  • Absolutely yes. That's the common use case. Take a look at section **Following links** in the tutorial page. The crucial point is to create new requests from the links being extracted from within your parse method – Raphael Jun 19 '19 at 14:12
  • well I already made something, but still it is not that I want. Opened a new question where I described my issue in details https://stackoverflow.com/questions/56663789/how-to-get-all-pages-from-the-whole-website-using-python – Pavel Pereverzev Jun 19 '19 at 14:16