I'd like to scrape a movie forum, which has a structure like
Page 1
Thread 1 in Page 1
Thread 2 in Page 1
...
Page 2
Thread 1 in Page 2
Thread 2 in Page 2
...
The pages and threads have very different htmls, so I have written xpath expressions to extract the information I need for pages and threads.
In the parese()
method of my spider, I used an example from the documentation to go through each page:
page_links = ['page_1', 'page_2', ...]
for page_link in page_links:
if page_link is not None:
page_link = response.urljoin(page_link)
yield scrapy.Request(page_link, callback=self.parse)
So I can get the URL of every thread in every page.
I suppose the next thing I should do is to get the response
of each thead, and run a function to parse these responses. But since I'm new to OOP, I'm quite confused with what I should do.
I have a list thread_links
that stores the URLs of threads, and I'm trying to do something like:
thread_links = ['thread_1', 'thread_2', ...]
for thread_link in thread_links:
yield scrapy.Request(thead_link)
but how can I pass these responses to a function like parse_thread(self, response)
?
Update: Here are my codes:
# -*- coding: utf-8 -*-
import scrapy
class ShtSpider(scrapy.Spider):
name = 'sht'
allowed_domains = ['AAABBBCCC.com']
start_urls = [
'https://AAABBBCCC/forum-103-1.html',
]
thread_links = []
def parse(self, response):
temp = response.selector.xpath("//div[@class='pg']/a[@class='last']/@href").get()
total_num_pages = int(temp.split('.')[0].split('-')[-1])
for page_i in range(total_num_pages):
page_link = temp.split('.')[0].rsplit('-', 1)[0] + '-' + str(page_i) + '.html'
if page_link is not None:
page_link = response.urljoin(page_link)
print(page_link)
yield scrapy.Request(page_link, callback=self.parse)
self.thread_links.extend(response.selector.
xpath("//tbody[contains(@id,'normalthread')]//td[@class='icn']//a/@href").getall())
for thread_link in self.thread_links:
thread_link = response.urljoin(thread_link)
print(thread_link)
yield scrapy.Request(url=thread_link, callback=self.parse_thread)
def parse_thread(self, response):
def extract_thread_data(xpath_expression):
return response.selector.xpath(xpath_expression).getall()
yield {
'movie_number_and_title': extract_thread_data("//span[@id='thread_subject']/text()"),
'movie_pics_links': extract_thread_data("//td[@class='t_f']//img/@file"),
'magnet_link': extract_thread_data("//div[@class='blockcode']/div//li/text()"),
'torrent_link': extract_thread_data("//p[@class='attnm']/a/@href"),
'torrent_name': extract_thread_data("//p[@class='attnm']/a/text()"),
}
I'm using print()
to check page_link
and thread_link
, they seems to be working well, URLs to all pages and threads shows up correctly, but the program stoped after only crawling one page. Here is the information from consolo:
2020-07-18 10:54:30 [scrapy.extensions.logstats] INFO: Crawled 1 pages (at 1 pages/min), scraped 0 items (at 0 items/min)
2020-07-18 10:54:30 [scrapy.core.engine] INFO: Closing spider (finished)
2020-07-18 10:54:30 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 690,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 17304,
'downloader/response_count': 2,
'downloader/response_status_count/200': 1,
'downloader/response_status_count/302': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 7, 18, 2, 54, 30, 777513),
'log_count/DEBUG': 3,
'log_count/INFO': 10,
'memusage/max': 53985280,
'memusage/startup': 48422912,
'offsite/domains': 1,
'offsite/filtered': 919087,
'request_depth_max': 1,
'response_received_count': 1,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2020, 7, 18, 2, 52, 54, 509604)}
2020-07-18 10:54:30 [scrapy.core.engine] INFO: Spider closed (finished)
Update: Here is the example from documentation
def parse_page1(self, response):
return scrapy.Request("http://www.example.com/some_page.html",
callback=self.parse_page)
def parse_page2(self, response):
# this would log http://www.example.com/some_page.html
self.logger.info("Visited %s", response.url)
and if I understand it correctly, it will will print that it has visited the url http://www.example.com/some_page.html
Here are my spider, which I have just created a project named SMZDM and created a spider using scrapy genspider smzdm https://www.smzdm.com
# -*- coding: utf-8 -*-
import scrapy
class SmzdmSpider(scrapy.Spider):
name = 'smzdm'
allowed_domains = ['https://www.smzdm.com']
start_urls = ['https://www.smzdm.com/fenlei/diannaozhengji/']
def parse(self, response):
return scrapy.Request("https://www.smzdm.com/fenlei/diannaozhengji/",
callback=self.parse_page)
def parse_page(self, response):
self.logger.info("Visited %s", response.url)
print(f'Crawled {response.url}')
I have hardcoded https://www.smzdm.com/fenlei/diannaozhengji/
in the parse method and just want to get it working.
But when run using scrapy crawl smzdm
, nothing shows up in the terminal. It seems the parse_page
method was never executed.
(Crawler) zheng@Macbook_Pro spiders % scrapy crawl smzdm
2020-07-29 17:10:03 [scrapy.utils.log] INFO: Scrapy 1.6.0 started (bot: SMZDM)
2020-07-29 17:10:03 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.9, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 20.3.0, Python 3.7.6 (default, Jan 8 2020, 13:42:34) - [Clang 4.0.1 (tags/RELEASE_401/final)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1g 21 Apr 2020), cryptography 2.9.2, Platform Darwin-19.6.0-x86_64-i386-64bit
2020-07-29 17:10:03 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'SMZDM', 'NEWSPIDER_MODULE': 'SMZDM.spiders', 'SPIDER_MODULES': ['SMZDM.spiders']}
2020-07-29 17:10:03 [scrapy.extensions.telnet] INFO: Telnet Password: e3b9631aa810732d
2020-07-29 17:10:03 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats']
2020-07-29 17:10:03 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-07-29 17:10:03 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-07-29 17:10:03 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-07-29 17:10:03 [scrapy.core.engine] INFO: Spider opened
2020-07-29 17:10:03 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-07-29 17:10:03 [py.warnings] WARNING: /Applications/anaconda3/envs/Crawler/lib/python3.7/site-packages/scrapy/spidermiddlewares/offsite.py:61: URLWarning: allowed_domains accepts only domains, not URLs. Ignoring URL entry https://www.smzdm.com in allowed_domains.
warnings.warn(message, URLWarning)
2020-07-29 17:10:03 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-07-29 17:10:03 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.smzdm.com/fenlei/diannaozhengji/> (referer: None)
2020-07-29 17:10:03 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.smzdm.com': <GET https://www.smzdm.com/fenlei/diannaozhengji/>
2020-07-29 17:10:03 [scrapy.core.engine] INFO: Closing spider (finished)
2020-07-29 17:10:03 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 321,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 40270,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 7, 29, 9, 10, 3, 836061),
'log_count/DEBUG': 2,
'log_count/INFO': 9,
'log_count/WARNING': 1,
'memusage/max': 48422912,
'memusage/startup': 48422912,
'offsite/domains': 1,
'offsite/filtered': 1,
'request_depth_max': 1,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2020, 7, 29, 9, 10, 3, 381441)}
2020-07-29 17:10:03 [scrapy.core.engine] INFO: Spider closed (finished)