1

I am learning web-scraping.

I succeeded scraping top youtubers ranking with this as reference.

I am using the same logic to scrape the PL ranking, but having two issues:

  1. it is only collecting up to 5th place.
  2. it is getting only the first place for the result
  3. and then, getting attribute error:

error

    from bs4 import BeautifulSoup
    import requests
    import csv


    url = 'https://www.premierleague.com/tables'
    page = requests.get(url)
    soup = BeautifulSoup(page.content, 'html.parser')
    
    standings = soup.find('div', attrs={'data-ui-tab': 'First Team'}).find_all('tr')[1:]
    print(standings)
    
    file = open("pl_standings.csv", 'w')
    writer = csv.writer(file)
    
    writer.writerow(['position', 'club_name', 'points'])
    
    for standing in standings:
        position = standing.find('span', attrs={'class': 'value'}).text.strip()
        club_name = standing.find('span', {'class': 'long'}).text
        points = standing.find('td', {'class': 'points'}).text
    
        print(position, club_name, points)
    
        writer.writerow([position, club_name, points])
    
    file.close()
veryreverie
  • 2,871
  • 2
  • 13
  • 26

1 Answers1

1

The issue is that html.parser doesn't parse the page correctly (try using lxml parser). Also, there get every second <tr> to get correct results:

import requests
from bs4 import BeautifulSoup


url = "https://www.premierleague.com/tables"
page = requests.get(url)
soup = BeautifulSoup(page.content, "lxml") # <-- use lxml

standings = soup.find("div", attrs={"data-ui-tab": "First Team"}).find_all(
    "tr"
)[1::2]  # <-- get every second <tr>

for standing in standings:
    position = standing.find("span", attrs={"class": "value"}).text.strip()
    club_name = standing.find("span", {"class": "long"}).text
    points = standing.find("td", {"class": "points"}).text
    print(position, club_name, points)

Prints:

1 Manchester City 77
2 Liverpool 76
3 Chelsea 62
4 Tottenham Hotspur 57
5 Arsenal 57
6 Manchester United 54
7 West Ham United 52
8 Wolverhampton Wanderers 49
9 Leicester City 41
10 Brighton and Hove Albion 40
11 Newcastle United 40
12 Brentford 39
13 Southampton 39
14 Crystal Palace 37
15 Aston Villa 36
16 Leeds United 33
17 Everton 29
18 Burnley 28
19 Watford 22
20 Norwich City 21
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
  • 1
    May I ask why that is that the `html.parser` doesn't work here and `lxml` does? – Rabinzel Apr 23 '22 at 10:20
  • 1
    @Rabinzel `html.parses` behaves differently when there's malformed HTML. `lxml` behaves more standard-compliant in this regard. For full compliance, there's also `html5lib` but it's slow (it's tradeoff between speed/correctness) – Andrej Kesely Apr 23 '22 at 10:23
  • thanks! so as someone who isn't really into `html` and wants to scrape something here and there (I wouldn't spot malformed html right away) I go best with html5lib as a start and then just try out the others if performance matters and compare the results ? – Rabinzel Apr 23 '22 at 10:27
  • @Rabinzel When you're scraping just few pages/learning, you can go with `html5lib` no-problem. But when you're trying to scrape at scale, the slowness quickly adds up (but there's also `multiprocessing` module to help). – Andrej Kesely Apr 23 '22 at 10:31
  • 1
    ok, thanks. I'll keep that in mind. – Rabinzel Apr 23 '22 at 10:43
  • 1
    Never thought about the html.parser issue.. Thank you so much for your help @AndrejKesely :) – jisoooh0202 Apr 23 '22 at 19:48
  • just in case, for some newbies getting another error with lxml, I did 'pip install lxml' and it worked. but also, [this](https://stackoverflow.com/questions/24398302/bs4-featurenotfound-couldnt-find-a-tree-builder-with-the-features-you-requeste) and [this](https://stackoverflow.com/questions/17766725/how-to-re-install-lxml) link have different solutions. – jisoooh0202 Apr 23 '22 at 20:12