0

This spider is supposed to loop through https://lihkg.com/thread/`2169007 - i*10`/page/1. But for some reason it skips pages in the loop.

I looked through the item scraped in Scrapy Cloud, the items with the following urls were scraped:

...
Item 10: https://lihkg.com/thread/2479941/page/1
Item 11: https://lihkg.com/thread/2479981/page/1
Item 12: https://lihkg.com/thread/2479971/page/1
Item 13: https://lihkg.com/thread/2479931/page/1
Item 14: https://lihkg.com/thread/2479751/page/1
Item 15: https://lihkg.com/thread/2479991/page/1
Item 16: https://lihkg.com/thread/1504771/page/1
Item 17: https://lihkg.com/thread/1184871/page/1
Item 18: https://lihkg.com/thread/1115901/page/1
Item 19: https://lihkg.com/thread/1062181/page/1
Item 20: https://lihkg.com/thread/1015801/page/1
Item 21: https://lihkg.com/thread/955001/page/1
Item 22: https://lihkg.com/thread/955011/page/1
Item 23: https://lihkg.com/thread/955021/page/1
Item 24: https://lihkg.com/thread/955041/page/1
...

About a million of pages were skipped.

Here's the code:

from lihkg.items import LihkgItem
import scrapy
import time
from scrapy_splash import SplashRequest

class LihkgSpider13(scrapy.Spider):
    name = 'lihkg1-950000'
    http_user = '(my splash api key here)'
    allowed_domains = ['lihkg.com']
    start_urls = ['https://lihkg.com/']

    script1 = """
                function main(splash, args)
                splash.images_enabled = false
                assert (splash:go(args.url))
                assert (splash:wait(2))
                return {
                    html = splash: html(),
                    png = splash:png(),
                    har = splash:har(),
                }
                end
              """

    def parse(self, response):
        for i in range(152500):
            time.sleep(0)
            url = "https://lihkg.com/thread/" + str(2479991 - i*10) + "/page/1"
            yield SplashRequest (url=url, callback=self.parse_article, endpoint='execute',
                                args={
                                    'html': 1,
                                    'lua_source': self.script1,
                                    'wait': 2,
                                })

    def parse_article(self, response):
        item = LihkgItem()
        item['author'] = response.xpath('//*[@id="1"]/div/small/span[2]/a/text()').get()
        item['time'] = response.xpath('//*[@id="1"]/div/small/span[4]/@data-tip').get()
        item['texts'] = response.xpath('//*[@id="1"]/div/div[1]/div/text()').getall()
        item['images'] = response.xpath('//*[@id="1"]/div/div[1]/div/a/@href').getall()
        item['emoji'] = response.xpath('//*[@id="1"]/div/div[1]/div/img/@src').getall()
        item['title'] = response.xpath('//*[@id="app"]/nav/div[2]/div[1]/span/text()').get()
        item['likes'] = response.xpath('//*[@id="1"]/div/div[2]/div/div[1]/div/div[1]/label/text()').get()
        item['dislikes'] = response.xpath('//*[@id="1"]/div/div[2]/div/div[1]/div/div[2]/label/text()').get()
        item['category'] = response.xpath('//*[@id="app"]/nav/div[1]/div[2]/div/span/text()').get()
        item['url'] = response.url

        yield item

I have enabled Crawlera, DeltaFetch and DotScrapy Persistence in the project.

shingseto
  • 1
  • 1
  • Welcome to StackOverflow! What is your problem exactly? `i * 10` means it will only get every 10th thread, is this what you want? You might also be rate-limited. I'm getting a `403 Forbidden` running your code with `scrapy runspider`. – xjcl Apr 10 '21 at 12:31
  • Thanks for the reply. Yes, I would like to skip 9 threads and get only every 10th thread. The problem is that not all every 10th threads was scraped. From the item list I posted above you can see between item 15 and 21, a million pages were skipped, making threads with `url=https://lihkg.com/thread/1xxxxxx/page/1/` not successfully scrapped. I guessed that the `403 Forbidden` is coming from the splash setting. I used splash in this project and the `api key` needed to be defined in the spider. – shingseto Apr 10 '21 at 14:31
  • Okay, can I easily get an API key? I'd also try using `time.sleep(1)` instead of `0` and check the results. – xjcl Apr 10 '21 at 14:37
  • You would have to add `SPLASH_URL = 'https://6m8olshj-splash.scrapinghub.com` in setting.py and the api key is `9bd030a7050e45199f98529b10f13589` – shingseto Apr 10 '21 at 14:41

0 Answers0