-1

I'm trying to scrape the results from the website https://howlongtobeat.com/#search. However, when I scrape, only the first 6 results only out of 20.

My code:

import scrapy



cards =  response.css('div[class="search_list_details"]')

for card in cards: 
    game_name = card.css('a[class=text_white]::attr(title)').get()
    print(game_name)

output:

'Elden Ring'
'Cyberpunk 2077'
'Kirby and the Forgotten Land'
'LEGO Star Wars The Skywalker Saga'
'Tomb Raider'
'Hollow Knight'
'Eiyuden Chronicle Rising' #This is not displayed on the page
'This War of Mine' #This is also not displayed on the page

I tried using other selectors for the cards such as response.css('li[class=back_darkish]'), but to no avail.

Also, how do I get the other data such as hours to beat so that I get a dict of the name, type of completion and hours?:

<div>
    <div class="search_list_tidbit text_white shadow_text">Main Story</div>
    <div class="search_list_tidbit center time_100">50½ Hours </div>
    <div class="search_list_tidbit text_white shadow_text">Main + Extra</div>
    <div class="search_list_tidbit center time_100">94 Hours </div>
    <div class="search_list_tidbit text_white shadow_text">Completionist</div>
    <div class="search_list_tidbit center time_100">127 Hours </div>
</div>
Baraa Zaid
  • 362
  • 1
  • 9

1 Answers1

1

Actually, Data is generating from external url which is API calls HTML response as POST method.

import scrapy
from scrapy.crawler import CrawlerProcess

class TestSpider(scrapy.Spider):
    name = 'test'  
    def start_requests(self):
        url = 'https://howlongtobeat.com/search_results?page=1'
        payload = "queryString=&t=games&sorthead=popular&sortd=0&plat=&length_type=main&length_min=&length_max=&v=&f=&g=&detail=&randomize=0"
        headers = {
            "content-type":"application/x-www-form-urlencoded",
            "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.54 Safari/537.36"
        }

        yield scrapy.Request(url,method='POST', body=payload,headers=headers,callback=self.parse)

    def parse(self, response):
        cards = response.css('div[class="search_list_details"]')

        for card in cards: 
            game_name = card.css('a[class=text_white]::attr(title)').get()
            yield {
                "game_name":game_name
            }
           

if __name__ == "__main__":
    process =CrawlerProcess()
    process.crawl(TestSpider)
    process.start()

Output:

{'game_name': 'Elden Ring'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'Cyberpunk 2077'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'Kirby and the Forgotten Land'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'LEGO Star Wars The Skywalker Saga'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'Hollow Knight'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'Tomb Raider'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'Portal 2'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'Hades'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'The Witcher 3 Wild Hunt'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'Red Dead Redemption 2'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'BioShock'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'Portal'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'Horizon Forbidden West'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'Trek to Yomi'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'Grand Theft Auto V'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'God of War'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'Marvels Guardians of the Galaxy'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'BioShock Infinite'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'Pokmon Legends Arceus'}
2022-05-12 13:37:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://howlongtobeat.com/search_results?page=1>
{'game_name': 'Horizon Zero Dawn  Complete Edition'}
2022-05-12 13:37:12 [scrapy.core.engine] INFO: Closing spider (finished)
2022-05-12 13:37:12 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 490,
 'downloader/request_count': 1,
 'downloader/request_method_count/POST': 1,
 'downloader/response_bytes': 2754,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'elapsed_time_seconds': 1.49537,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2022, 5, 12, 7, 37, 12, 172047),
 'httpcompression/response_bytes': 23986,
 'httpcompression/response_count': 1,
 'item_scraped_count': 20,
Md. Fazlul Hoque
  • 15,806
  • 5
  • 12
  • 32
  • God bless you, it works! But what about the second part of the question, scraping the hours played for each type of play through? They all have the same div class – Baraa Zaid May 12 '22 at 08:17
  • @Baraa Zaid, It would be better to raise a new question on that topic arranging this code nicely and don't worry about it ,using xpath it's possible just specify what data value do you want.Thanks – Md. Fazlul Hoque May 12 '22 at 08:23
  • I know you answered this a while ago, but I appear to be suddenly getting `Redirecting (302) to from ` from the crawler. What seems to be the issue? – Baraa Zaid May 21 '22 at 03:35