0

I am trying to use Scrapy to get the names of all current WWE superstars from the following url: http://www.wwe.com/superstars However, when I run my scraper, it does not return any names. I believe (through attempting the problem with other modules) that the problem is that Scrapy is not finding all of the html elements from the page. I attempted the problem with requests and Beautiful Soup, and when I looked at the html that requests got, it was missing important aspects of the html that I was seeing in my browsers inspector. The html containing the names looks like this:

<div class="superstars--info"> == $0
    <span class="superstars--name">name here</span>
</div>

My code is posted below. Is there something that I am doing wrong that is causing this not to work?

import scrapy

class SuperstarSpider(scrapy.Spider):
    name = "star_spider"
    start_urls = ["http://www.wwe.com/superstars"]

    def parse(self, response):
        star_selector = '.superstars--info'
        for star in response.css(star_selector):
            NAME_SELECTOR = 'span ::text'
            yield {
                'name' : star.css(NAME_SELECTOR).extract_first(),
            }
SPFort
  • 53
  • 1
  • 10
  • It sounds like the site may be serving dynamic content (which means that some of the HTML is created/modified only when a web browser is using it). This has the effect of having a "hole" in the HTML where you see data on your dev tools but not in the scrapy scraped page. Does this [q/a](https://stackoverflow.com/q/8550114/3491991) sound like your problem? – zelusp Apr 09 '18 at 17:59
  • Yeah, that sounds like what I'm experiencing. Is there any way around that? – SPFort Apr 09 '18 at 18:15
  • Following a tip from that thread... it looks like you could try finding the endpoint to that data then query that directly (you'll probably have to reverse engineer the query structure from that page's javascript) – zelusp Apr 09 '18 at 19:24

2 Answers2

2

Sounds like the site has dynamic content which maybe loaded using javascript and/or xhr calls. Look into splash it's a javascript render engine that behaves a lot like phantomjs. If you know how to use docker, splash is super simple to setup. After you have splash setup, you'll have to integrate it with scrapy by using the scrapy-splash plugin.

notorious.no
  • 4,919
  • 3
  • 20
  • 34
1

Since the content is javascript generated, you have two options: use something like selenium to mimic a browser and parse the html content, or if you can, query an API directly.

In this case, this simple solution works:

import requests
import json


URL = "http://www.wwe.com/api/superstars"

with requests.session() as s:
    s.headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:50.0) Gecko/20100101 Firefox/50.0'}
    resp = s.get(URL).json()
    for x in resp['talent'][:10]:
        print(x['name'])

Output (first 10 records):

Abdullah the Butcher
Adam Bomb
Adam Cole
Adam Rose
Aiden English
AJ Lee
AJ Styles
Akam
Akeem
Akira Tozawa
drec4s
  • 7,946
  • 8
  • 33
  • 54
  • I went the route of Selenium. It's really easy to use and I've since been using it for other web scraping tasks. – SPFort May 10 '18 at 16:17