Web scraping Python beautifulsoup

Question

I'm trying to create a crawler that scans the website https://www.superherodb.com/ and fetches the information on all the superheroes (seen on:https://www.superherodb.com/characters) from their individual pages. I want to fetch all the information on the hero: the stats, powers, equipment, origin, connections, etc. But I am having trouble accessing their stats from the hero's page.

For example, this page: https://www.superherodb.com/001/10-39302/

For the Power Stats section in the hero's page I tried:

  bs_test.find_all("div", {"class": "stat-value"})

and:

    bs_test.select(".stat-value")

But the output always outputs 0 as the value:

[<div class="stat-value">0</div>,
 <div class="stat-value">0</div>,
 <div class="stat-value">0</div>,
 <div class="stat-value">0</div>,
 <div class="stat-value">0</div>,
 <div class="stat-value">0</div>,
 <div class="stat-value">0</div>]

What am I missing here? Please help me.

score 1 · Answer 1 · answered Mar 08 '22 at 16:26

1

They aren't visible there. Try scraping <class="note footnote"> rather than the stat_value. It provides the following data:

stats_10_39302_shdb = {"stats":{"int":140,"str":45,"spe":5,"dur":5,"pow":0,"com":20,"tie":0},"bars":{"int":70,"str":1,"spe":1,"dur":5,"pow":0,"com":20,"tie":0}

for the Han example.

answered Mar 08 '22 at 16:26

bensonium

71
5

Thank you for the help, but how do I store the values in a dict or list instead of just printing them? – Baraa Zaid Mar 09 '22 at 15:03

score 0 · Accepted Answer · answered Mar 08 '22 at 16:38

The data is injected by JS after the page loads, but requests.get only gives you the static HTML, which has placeholder values alongside a <script> tag with a JSON-formatted JS object with the actual data.

Following up on the astute answer from bensonium, here's how you can pull the data out of the .footnote script element:

import json
import re
import requests

response = requests.get("https://www.superherodb.com/001/10-39302/")
response.raise_for_status()
stats = [json.loads(x) for x in re.findall(r'{"stats":[^;]+', response.text)]
print(json.dumps(stats, indent=2))

Output:

[
  {
    "stats": {
      "int": 140,
      "str": 45,
      "spe": 5,
      "dur": 5,
      "pow": 0,
      "com": 20,
      "tie": 0
    },
    "bars": {
      "int": 70,
      "str": 1,
      "spe": 1,
      "dur": 5,
      "pow": 0,
      "com": 20,
      "tie": 0
    },
    "shdbclass": {
      "value": 10,
      "visual": 10,
      "level": 1
    },
    "specials": {
      "omnipotent": 0,
      "omniscient": 0,
      "omnipresent": 0
    }
  },
  {
    "stats": {
      "int": 100,
      "str": 100,
      "spe": 10,
      "dur": 1,
      "pow": 1,
      "com": 1,
      "tie": 9
    },
    "bars": {
      "int": 50,
      "str": 1,
      "spe": 6,
      "dur": 1,
      "pow": 1,
      "com": 1,
      "tie": 90
    },
    "shdbclass": {
      "value": 20.5,
      "visual": 21,
      "level": 1
    },
    "specials": {
      "omnipotent": 0,
      "omniscient": 0,
      "omnipresent": 0
    },
    "ustats": 1
  }
]

See the canonical Web-scraping JavaScript page with Python for a generalization of this approach and more explanations and strategies for scraping JS-driven pages.

omg you're a LIFESAVER!!! Thanks a million man. This worked like a charm. You have no idea how much I appreciate this! And I love how neat and tidy the results are and how simple the code you entered. Now all I have to do is loop this over the links from all the other characters :D — Baraa Zaid, Mar 08 '22 at 19:22
I'm not sure I follow -- `stats` is a list of dicts already. I dump it to JSON on the last line because by default, Python prints list/dict structures in a compressed, unreadable manner. — ggorlen, Mar 09 '22 at 15:24
oh ok sorry I just noticed this was a nested dictionary... my bad. Thanks again and sorry for your troubles — Baraa Zaid, Mar 10 '22 at 07:37

Web scraping Python beautifulsoup

2 Answers2