Some questions about scraping a forum with scrapy

Question

I'd like to scrape a movie forum, which has a structure like

Page 1
 Thread 1 in Page 1
 Thread 2 in Page 1
 ...
Page 2
 Thread 1 in Page 2
 Thread 2 in Page 2
 ...

The pages and threads have very different htmls, so I have written xpath expressions to extract the information I need for pages and threads.

In the parese() method of my spider, I used an example from the documentation to go through each page:

page_links = ['page_1', 'page_2', ...]

for page_link in page_links:
  if page_link is not None:
     page_link = response.urljoin(page_link)
     yield scrapy.Request(page_link, callback=self.parse)

So I can get the URL of every thread in every page.

I suppose the next thing I should do is to get the response of each thead, and run a function to parse these responses. But since I'm new to OOP, I'm quite confused with what I should do.

I have a list thread_links that stores the URLs of threads, and I'm trying to do something like:

thread_links = ['thread_1', 'thread_2', ...]

for thread_link in thread_links:
    yield scrapy.Request(thead_link)

but how can I pass these responses to a function like parse_thread(self, response)?

Update: Here are my codes:

# -*- coding: utf-8 -*-
import scrapy


class ShtSpider(scrapy.Spider):
    name = 'sht'
    allowed_domains = ['AAABBBCCC.com']
    start_urls = [
        'https://AAABBBCCC/forum-103-1.html',
                  ]
    thread_links = []

    def parse(self, response):
        temp = response.selector.xpath("//div[@class='pg']/a[@class='last']/@href").get()
        total_num_pages = int(temp.split('.')[0].split('-')[-1])

        for page_i in range(total_num_pages):
            page_link = temp.split('.')[0].rsplit('-', 1)[0] + '-' + str(page_i) + '.html'

            if page_link is not None:
                page_link = response.urljoin(page_link)
                print(page_link)
                yield scrapy.Request(page_link, callback=self.parse)

            self.thread_links.extend(response.selector.
                                     xpath("//tbody[contains(@id,'normalthread')]//td[@class='icn']//a/@href").getall())
            for thread_link in self.thread_links:
                thread_link = response.urljoin(thread_link)
                print(thread_link)
                yield scrapy.Request(url=thread_link, callback=self.parse_thread)

    def parse_thread(self, response):
        def extract_thread_data(xpath_expression):
            return response.selector.xpath(xpath_expression).getall()

        yield {
            'movie_number_and_title': extract_thread_data("//span[@id='thread_subject']/text()"),
            'movie_pics_links': extract_thread_data("//td[@class='t_f']//img/@file"),
            'magnet_link': extract_thread_data("//div[@class='blockcode']/div//li/text()"),
            'torrent_link': extract_thread_data("//p[@class='attnm']/a/@href"),
            'torrent_name': extract_thread_data("//p[@class='attnm']/a/text()"),
        }

I'm using print() to check page_link and thread_link, they seems to be working well, URLs to all pages and threads shows up correctly, but the program stoped after only crawling one page. Here is the information from consolo:

2020-07-18 10:54:30 [scrapy.extensions.logstats] INFO: Crawled 1 pages (at 1 pages/min), scraped 0 items (at 0 items/min)
2020-07-18 10:54:30 [scrapy.core.engine] INFO: Closing spider (finished)
2020-07-18 10:54:30 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 690,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 17304,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 1,
 'downloader/response_status_count/302': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2020, 7, 18, 2, 54, 30, 777513),
 'log_count/DEBUG': 3,
 'log_count/INFO': 10,
 'memusage/max': 53985280,
 'memusage/startup': 48422912,
 'offsite/domains': 1,
 'offsite/filtered': 919087,
 'request_depth_max': 1,
 'response_received_count': 1,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'start_time': datetime.datetime(2020, 7, 18, 2, 52, 54, 509604)}
2020-07-18 10:54:30 [scrapy.core.engine] INFO: Spider closed (finished)

Update: Here is the example from documentation

def parse_page1(self, response):
    return scrapy.Request("http://www.example.com/some_page.html",
                          callback=self.parse_page)

def parse_page2(self, response):
    # this would log http://www.example.com/some_page.html
    self.logger.info("Visited %s", response.url)

and if I understand it correctly, it will will print that it has visited the url http://www.example.com/some_page.html

Here are my spider, which I have just created a project named SMZDM and created a spider using scrapy genspider smzdm https://www.smzdm.com

# -*- coding: utf-8 -*-
import scrapy


class SmzdmSpider(scrapy.Spider):
    name = 'smzdm'
    allowed_domains = ['https://www.smzdm.com']
    start_urls = ['https://www.smzdm.com/fenlei/diannaozhengji/']

    def parse(self, response):
        return scrapy.Request("https://www.smzdm.com/fenlei/diannaozhengji/",
                              callback=self.parse_page)

    def parse_page(self, response):
        self.logger.info("Visited %s", response.url)
        print(f'Crawled {response.url}')

I have hardcoded https://www.smzdm.com/fenlei/diannaozhengji/ in the parse method and just want to get it working.

But when run using scrapy crawl smzdm, nothing shows up in the terminal. It seems the parse_page method was never executed.

(Crawler) zheng@Macbook_Pro spiders % scrapy crawl smzdm
2020-07-29 17:10:03 [scrapy.utils.log] INFO: Scrapy 1.6.0 started (bot: SMZDM)
2020-07-29 17:10:03 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.9, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 20.3.0, Python 3.7.6 (default, Jan  8 2020, 13:42:34) - [Clang 4.0.1 (tags/RELEASE_401/final)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1g  21 Apr 2020), cryptography 2.9.2, Platform Darwin-19.6.0-x86_64-i386-64bit
2020-07-29 17:10:03 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'SMZDM', 'NEWSPIDER_MODULE': 'SMZDM.spiders', 'SPIDER_MODULES': ['SMZDM.spiders']}
2020-07-29 17:10:03 [scrapy.extensions.telnet] INFO: Telnet Password: e3b9631aa810732d
2020-07-29 17:10:03 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2020-07-29 17:10:03 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-07-29 17:10:03 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-07-29 17:10:03 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-07-29 17:10:03 [scrapy.core.engine] INFO: Spider opened
2020-07-29 17:10:03 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-07-29 17:10:03 [py.warnings] WARNING: /Applications/anaconda3/envs/Crawler/lib/python3.7/site-packages/scrapy/spidermiddlewares/offsite.py:61: URLWarning: allowed_domains accepts only domains, not URLs. Ignoring URL entry https://www.smzdm.com in allowed_domains.
  warnings.warn(message, URLWarning)

2020-07-29 17:10:03 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-07-29 17:10:03 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.smzdm.com/fenlei/diannaozhengji/> (referer: None)
2020-07-29 17:10:03 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.smzdm.com': <GET https://www.smzdm.com/fenlei/diannaozhengji/>
2020-07-29 17:10:03 [scrapy.core.engine] INFO: Closing spider (finished)
2020-07-29 17:10:03 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 321,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 40270,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2020, 7, 29, 9, 10, 3, 836061),
 'log_count/DEBUG': 2,
 'log_count/INFO': 9,
 'log_count/WARNING': 1,
 'memusage/max': 48422912,
 'memusage/startup': 48422912,
 'offsite/domains': 1,
 'offsite/filtered': 1,
 'request_depth_max': 1,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2020, 7, 29, 9, 10, 3, 381441)}
2020-07-29 17:10:03 [scrapy.core.engine] INFO: Spider closed (finished)

score 2 · Accepted Answer · edited Feb 24 '21 at 23:12

A full code example would help me direct you towards the best way to achieve what you want. But sounds like you're on the right way.

Think you just need to do a callback on a parse_thread function

Coding Example

thread_links = ['thread1','thread2']

def thread_links(self,response)
    for thread_link in self.thread_links:
        yield scrapy.Request(url=thread_link,callback=self.parse_thread)

def parse_thread(self,response):
    print(response.text)

Explanation

Here we're taking the links from thread_links list, NOTE you have to do self.thread_links, thats because you're defining the thread_links list OUTSIDE the function. It's whats called a class variable and needs to accessed inside the function by self.VARIABLE.

We then add a callback to parse_thread, again note how we're using self.parse_thread here. Scrapy makes a requests and delivers the response to the parse_thread function. Here I've just printed that response out.

Updated Code

Since you've provided some code here's where I think you may be going wrong, if you've checked the pages and threads links are outputting fine.

def parse_thread(self, response):
    def extract_thread_data(xpath_expression):
        return response.selector.xpath(xpath_expression).getall()

Change this to

def parse_thread(self,response):
    yield response.xpath(xpath_expression).getall()

I'm not sure because I can't test the code, but a nested function is probably going to cause scrapy abit of problems.

Tips/Clarifications and Suggestions to Code

No need to call write response.selector, response.xpath is fine.
When you make a callback from thread_links function to parse_thread, the response is the result from that scrapy request you make in thread_links. No need to wrap this in a function.
Make sure you understand the distinction of yield and return. Generally speaking if there's lots of data, a yield statement is much more efficient than a return. See here for an extensive write up

Update 2

You need to include these in your settings.py

settings.py

USER_AGENT = 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Mobile Safari/537.36'


ROBOTSTXT_OBEY = False

Explanation

The site is quite anti-scraping so within the robots.txt on the website it doesn't want you to scrape this site. To work around this we set ROBOTSTXT_OBEY = False.

In addition to this, when scrapy sends the HTTP request, you haven't defined a user-agent, this could be any user-agent. I've given an example of one that worked for me. Without the user-agent its detecting that it is not a browser making this type of request and scrapy is not scraping the url.

Thanks for your answer. I've followed your answer and read more documentation, but something is still not right. I'll post my code in my question. :) — Zheng, Jul 18 '20 at 03:02
Hi, after reading more scrapy documentation, I think I can understand your code now. But still I'm not able to get the code working. I grabbed a simple example from the documentation.`https://docs.scrapy.org/en/latest/topics/request-response.html`, and I'll post the simple code in my questions. — Zheng, Jul 29 '20 at 08:59
I've provided a second update, you need to change the settings in your settings.py to the above. Could you confirm and tick my answer as the accepted one. Also if there are further questions on this issue, could you please make a new question rather than use this thread. Thanks. — AaronS, Jul 29 '20 at 11:12

score 0 · Answer 2 · answered Sep 07 '20 at 09:45

0

I have finished my program now and I'd like to summarize two useful tips:

1.Try to comment out allowed_domains when debuging;

2.I'm not sure why but using response.Request has been problematic for me, when following links just use response.follow

answered Sep 07 '20 at 09:45

Zheng

93
9

Can i ask why you've taken my answer off as the accepted one? – AaronS Sep 08 '20 at 12:46

Some questions about scraping a forum with scrapy

2 Answers2

Coding Example

Explanation

Updated Code

Tips/Clarifications and Suggestions to Code

Update 2

settings.py

Explanation