Using Scrapy to simulate AJAX requests to collect paginated data

Question

I am doing a mini-project to collect some data from a popular League of Legends website, www.op.gg. For example, if you go to this page, you will see that there are 10 games worth of data shown on the right. If you keep scrolling down, you will see "Show More" at the bottom which will show the next 20 results, and so on. When I inspect the "Show More" element using chrome tools, I see the following entry:

<a href="#" onclick="$.OP.GG.matches.list.loadMore($(this)); return false;" class="Button">Show More</a>

I am currently using Scrapy to grab several datapoints from pages like this, and I have been successful in grabbing the first 10 games that show up but need some help on how to keep grabbing more until a set time period (i.e. keep showing more results until the "data-datetime" element in each GameItemWrap class is 30 days ago from runtime.)

My code is below:

import scrapy

class PostsSpider(scrapy.Spider):
    name = "posts"
    start_urls = [
        'https://na.op.gg/summoner/userName=C9+Zven',
        'https://na.op.gg/summoner/userName=From+Iron'
    ]

    def parse(self, response):
        summoner_name = response.css('.SummonerLayout>.Header>.Profile>.Information>.Name::text').get()
        rank_type = response.css('.TierRankInfo .RankType::text').get()
        tier_rank = response.css('.TierRankInfo .TierRank::text').get()   

        game_lists = []
        dict_per_game = {}

        for game in response.css('div.GameItemWrap'):
            dict_per_game['summoner_id'] = game.css('.GameItem::attr(data-summoner-id)').get() 
            dict_per_game['data_game_time'] = game.css('.GameItem::attr(data-game-time)').get() 
            dict_per_game['game_type'] = str.strip(game.css('.Content .GameStats .GameType::text').get())
            dict_per_game['date_time_epoch'] = game.css('.Content .GameStats .TimeStamp ._timeago::attr(data-datetime)').get()
            dict_per_game['game_result'] = str.strip(game.css('.Content .GameStats .GameResult::text').get())
            dict_per_game['champ_name'] = game.css('.Content .GameSettingInfo .ChampionName a::text').get()
            dict_per_game['kill'] = game.css('.Content .KDA .KDA .Kill::text').get()
            dict_per_game['death'] = game.css('.Content .KDA .KDA .Death::text').get()
            dict_per_game['assist'] = game.css('.Content .KDA .KDA .Assist::text').get()
            game_lists.append(dict_per_game)
            dict_per_game = {}

        yield {
            'summoner_name': summoner_name,
            'rank_type': rank_type, 
            'tier_rank': tier_rank, 
            'games': game_lists
        }

        # This is where I would like to add some code to retrieve more results for this profile going back to 30 days ago from runtime

using only `Selenium` you could simply click button. Using other tools you could check in DevTools (in tab Network) what request it send when you click button (especiall XHR requests) and use it - maybe it will work without simulate AJAX and you could get JSON data which can be easily converted to Python's list/dictionary. To simulate AJAX you would have to add one header `XMLHttpRequest` — furas, Feb 29 '20 at 08:04
When I use the network tab and click on 'Show More', [this](https://na.op.gg/summoner/matches/ajax/averageAndList/startInfo=1582695495&summonerId=91419120) kind of link appears. It is a JSON of HTML strings-- any idea the most efficient way to parse this in the same way I can do CSS parsing from Scrapy? — zeff, Feb 29 '20 at 08:15
you can use `lxml` or `Beautifulsoup` to parse it and search usign `css` or `xpath`. Or you can use Scrapy and `HtmlResponse` or `Selector` to create response from string/HTML - see [scrapy: convert html string to HtmlResponse object](https://stackoverflow.com/questions/27323740/scrapy-convert-html-string-to-htmlresponse-object) — furas, Feb 29 '20 at 09:06
The HTML field from [this](https://na.op.gg/summoner/matches/ajax/averageAndList/startInfo=1582695495&summonerId=91419120) JSON is broken or partial HTML so I am not sure what the best approach here is in parsing this HTML string efficiently. For example, if you dump the HTML string into a file and save it as an HTML file, opening it up will show something that doesn't make sense. Is there any way I can still search this string using CSS selectors? — zeff, Feb 29 '20 at 10:03
did you try to use this HTML directly with `HtmlResponse`, `Selector`, `BeautifulSoup` or `lmxl` - they don't need full HTML to work. — furas, Feb 29 '20 at 16:54
from scrapy import Selector data = Selector(text=string) data.css('#myid::text').get() — ThePyGuy, Mar 04 '20 at 00:11

Using Scrapy to simulate AJAX requests to collect paginated data

0 Answers0