Python Scrapy is not getting all html elements from a webpage

Question

I am trying to use Scrapy to get the names of all current WWE superstars from the following url: http://www.wwe.com/superstars However, when I run my scraper, it does not return any names. I believe (through attempting the problem with other modules) that the problem is that Scrapy is not finding all of the html elements from the page. I attempted the problem with requests and Beautiful Soup, and when I looked at the html that requests got, it was missing important aspects of the html that I was seeing in my browsers inspector. The html containing the names looks like this:

<div class="superstars--info"> == $0
    <span class="superstars--name">name here</span>
</div>

My code is posted below. Is there something that I am doing wrong that is causing this not to work?

import scrapy

class SuperstarSpider(scrapy.Spider):
    name = "star_spider"
    start_urls = ["http://www.wwe.com/superstars"]

    def parse(self, response):
        star_selector = '.superstars--info'
        for star in response.css(star_selector):
            NAME_SELECTOR = 'span ::text'
            yield {
                'name' : star.css(NAME_SELECTOR).extract_first(),
            }

It sounds like the site may be serving dynamic content (which means that some of the HTML is created/modified only when a web browser is using it). This has the effect of having a "hole" in the HTML where you see data on your dev tools but not in the scrapy scraped page. Does this [q/a](https://stackoverflow.com/q/8550114/3491991) sound like your problem? — zelusp, Apr 09 '18 at 17:59
Yeah, that sounds like what I'm experiencing. Is there any way around that? — SPFort, Apr 09 '18 at 18:15
Following a tip from that thread... it looks like you could try finding the endpoint to that data then query that directly (you'll probably have to reverse engineer the query structure from that page's javascript) — zelusp, Apr 09 '18 at 19:24

score 2 · Answer 1 · answered Apr 10 '18 at 03:44

Sounds like the site has dynamic content which maybe loaded using javascript and/or xhr calls. Look into splash it's a javascript render engine that behaves a lot like phantomjs. If you know how to use docker, splash is super simple to setup. After you have splash setup, you'll have to integrate it with scrapy by using the scrapy-splash plugin.

score 1 · Accepted Answer · answered Apr 09 '18 at 19:29

Since the content is javascript generated, you have two options: use something like selenium to mimic a browser and parse the html content, or if you can, query an API directly.

In this case, this simple solution works:

import requests
import json


URL = "http://www.wwe.com/api/superstars"

with requests.session() as s:
    s.headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:50.0) Gecko/20100101 Firefox/50.0'}
    resp = s.get(URL).json()
    for x in resp['talent'][:10]:
        print(x['name'])

Output (first 10 records):

Abdullah the Butcher
Adam Bomb
Adam Cole
Adam Rose
Aiden English
AJ Lee
AJ Styles
Akam
Akeem
Akira Tozawa

I went the route of Selenium. It's really easy to use and I've since been using it for other web scraping tasks. — SPFort, May 10 '18 at 16:17

Python Scrapy is not getting all html elements from a webpage

2 Answers2