How to efficiently parse html list into a dict?

Question

I am wondering how I can streamline this mess of code and put the output into a nice dictionary instead of list of tuples. Can I use BeautifulSoup in a better way, how?

from bs4 import BeautifulSoup as soup
import requests


data = []
sample = []

player_page = requests.get('https://www.premierleague.com/players/10483/Rolando-Aarons/stats')
cont = soup(player_page.content)
for strong_tag in cont.find_all('span', 'stat'):
    sample.append(strong_tag.text)
    tempStats = [x.replace("\r\n",",") for x in sample]
    tempStats = [x.replace("\n","") for x in tempStats]
    tempStats = [x.replace(" ","") for x in tempStats]
    tempStats = [i.split(',', 1) for i in tempStats] 
    tempStats = list(map(lambda sublist: tuple(map(str, sublist)), tempStats))
    tempStats = [tuple(int(item) if item.strip().isnumeric() else item for item in group) for group in tempStats]   
data.append(tempStats)
print(data)

My desired output looks like this:

PlayerName {stat1: 1, stat2: 2 , stat: 3, etc,etc}

The reason for this structure is so that I can extract specific keys and from several players and compare values.

Andrej Kesely · Accepted Answer · 2019-08-15T19:29:04.407

This script will create dictionary of all stats found on the page:

from bs4 import BeautifulSoup as soup
import requests

player_page = requests.get('https://www.premierleague.com/players/10483/Rolando-Aarons/stats')
cont = soup(player_page.content, 'lxml')

data = dict((k.contents[0].strip(), v.get_text(strip=True)) for k, v in zip(cont.select('.topStat span.stat, .normalStat span.stat'), cont.select('.topStat span.stat > span, .normalStat span.stat > span')))

from pprint import pprint
pprint(data)

Prints:

{'Accurate long balls': '8',
 'Aerial battles lost': '12',
 'Aerial battles won': '7',
 'Appearances': '18',
 'Assists': '1',
 'Big chances created': '1',
 'Big chances missed': '0',
 'Blocked shots': '2',
 'Clearances': '11',
 'Cross accuracy %': '21%',
 'Crosses': '19',
 'Duels lost': '67',
 'Duels won': '54',
 'Errors leading to goal': '1',
 'Fouls': '11',
 'Freekicks scored': '0',
 'Goals': '2',
 'Goals per match': '0.11',
 'Goals with left foot': '1',
 'Goals with right foot': '0',
 'Headed Clearance': '6',
 'Headed goals': '1',
 'Hit woodwork': '1',
 'Interceptions': '8',
 'Losses': '12',
 'Offsides': '1',
 'Passes': '197',
 'Passes per match': '10.94',
 'Penalties scored': '0',
 'Recoveries': '43',
 'Red cards': '0',
 'Shooting accuracy %': '27%',
 'Shots': '11',
 'Shots on target': '3',
 'Successful 50/50s': '14',
 'Tackle success %': '70%',
 'Tackles': '20',
 'Through balls': '0',
 'Wins': '3',
 'Yellow cards': '2'}

EDIT: To create dictionary with player name and his data, you can do this (data is from the script above):

players = {cont.select_one('.playerDetails .name').get_text(strip=True): data}

from pprint import pprint
pprint(players)

Prints:

{'Rolando Aarons': {'Accurate long balls': '8',
                    'Aerial battles lost': '12',
                    'Aerial battles won': '7',
                    'Assists': '1',
                    'Big chances created': '1',
                    'Big chances missed': '0',
                    'Blocked shots': '2',
...and so on.

Oh My Lord that is neat, is there anyway I can map this dictionary to the players name?I notice that the Appearances, Wins, Losses don't get scraped I guess that is because they are not included in the .normalStat css class — MisterButter, Aug 15 '19 at 19:20
This just blows my mind, sorry for asking but do you perhaps have any link where I can read up on this and try to decipher your script? — MisterButter, Aug 15 '19 at 19:33
@MisterButter It's not magic, just dict comprehension https://stackoverflow.com/questions/14507591/python-dictionary-comprehension The real work is finding the right CSS selectors. You can read about CSS selectors here for example https://www.w3schools.com/cssref/css_selectors.asp — Andrej Kesely, Aug 15 '19 at 19:36
Thank you Andrej for taking your precious time to provide the script and links, highly appreciate it!! — MisterButter, Aug 15 '19 at 19:38
I noticed that when trying to loop through more then one link, the script only returns the last player, and doesn't add the dictionaries together. I have tried appending all the dict to a list, put still only getting the last individual of the loop. Do you perhaps have any suggestions? — MisterButter, Aug 17 '19 at 10:49
@MisterButter For every player, get the player name to variable `player_name` and then you can add the player name to dictionary `players` for example like this: `players[player_name] = data`. — Andrej Kesely, Aug 17 '19 at 11:03

Ajax1234 · Answer 2 · 2019-08-15T19:28:49.130

You can use find_all to access the data from the statsListBlock divs:

import requests, re
from bs4 import BeautifulSoup as soup
d = soup(requests.get('https://www.premierleague.com/players/10483/Rolando-Aarons/stats').text, 'html.parser')
new_d = d.find_all('div', {'class':'statsListBlock'})
results = {i.div.text[1:-1]:{c.span.contents[0]:c.span.contents[-2].text for c in i.find_all('div', {'class':'normalStat'})} for i in new_d}
new_results = {a:{re.sub('\s+$', '', c):re.findall('\d+', d)[0] for c, d in b.items()} for a, b in results.items()}

Output:

{'Attack': {'Goals': '2', 'Goals per match': '0', 'Headed goals': '1', 'Goals with right foot': '0', 'Goals with left foot': '1', 'Penalties scored': '0', 'Freekicks scored': '0', 'Shots': '11', 'Shots on target': '3', 'Shooting accuracy %': '27', 'Hit woodwork': '1', 'Big chances missed': '0'}, 'Team Play': {'Assists': '1', 'Passes': '197', 'Passes per match': '10', 'Big chances created': '1', 'Crosses': '19', 'Cross accuracy %': '21', 'Through balls': '0', 'Accurate long balls': '8'}, 'Discipline': {'Yellow cards': '2', 'Red cards': '0', 'Fouls': '11', 'Offsides': '1'}, 'Defence': {'Tackles': '20', 'Tackle success %': '70', 'Blocked shots': '2', 'Interceptions': '8', 'Clearances': '11', 'Headed Clearance': '6', 'Recoveries': '43', 'Duels won': '54', 'Duels lost': '67', 'Successful 50/50s': '14', 'Aerial battles won': '7', 'Aerial battles lost': '12', 'Errors leading to goal': '1'}}

To associate the name:

new_result = {d.find('div', {'class':'name t-colour'}).text:new_results}

Output:

{'Rolando Aarons': {'Attack': {'Goals': '2', 'Goals per match': '0', 'Headed goals': '1', 'Goals with right foot': '0', 'Goals with left foot': '1', 'Penalties scored': '0', 'Freekicks scored': '0', 'Shots': '11', 'Shots on target': '3', 'Shooting accuracy %': '27', 'Hit woodwork': '1', 'Big chances missed': '0'}, 'Team Play': {'Assists': '1', 'Passes': '197', 'Passes per match': '10', 'Big chances created': '1', 'Crosses': '19', 'Cross accuracy %': '21', 'Through balls': '0', 'Accurate long balls': '8'}, 'Discipline': {'Yellow cards': '2', 'Red cards': '0', 'Fouls': '11', 'Offsides': '1'}, 'Defence': {'Tackles': '20', 'Tackle success %': '70', 'Blocked shots': '2', 'Interceptions': '8', 'Clearances': '11', 'Headed Clearance': '6', 'Recoveries': '43', 'Duels won': '54', 'Duels lost': '67', 'Successful 50/50s': '14', 'Aerial battles won': '7', 'Aerial battles lost': '12', 'Errors leading to goal': '1'}}}

This is insanely compact, thank you for taking your time to help! I'll try to understand that chinese you wrote! — MisterButter, Aug 15 '19 at 19:36
@MisterButter Glad to help! For more info on the code, particularly the `soup.contents` attr, please see [here](https://stackoverflow.com/questions/19602398/python-beautiful-soup-content-property) — Ajax1234, Aug 15 '19 at 19:38
Thank you @Ajax1234 for taking your valued time to help, and thank you for the link! — MisterButter, Aug 15 '19 at 19:40

How to efficiently parse html list into a dict?

2 Answers2