1

I have written the code to extract the data from the first page, but am running into problems when trying to extract from all pages.

This is my code to extract data from page 'a'

from bs4 import BeautifulSoup
import urllib
import urllib.request
import os
from string import ascii_lowercase


def make_soup(url):
    thepage = urllib.request.urlopen(url)
    soupdata = BeautifulSoup(thepage, 'html.parser')
    return soupdata

playerdatasaved = ""

soup = make_soup('https://www.basketball-reference.com/players/a/')

for record in soup.findAll("tr"): 
    playerdata = "" 
    for data in record.findAll(["th","td"]): 
        playerdata = playerdata + "," + data.text 

    playerdatasaved = playerdatasaved + "\n" + playerdata[1:]

print(playerdatasaved)

header = "player, from, to, position, height, weight, dob, year, 
colleges"+"\n"
file = open(os.path.expanduser("basketballstats.csv"),"wb")
file.write(bytes(header, encoding = "ascii", errors = "ignore"))
file.write(bytes(playerdatasaved[1:], encoding = "ascii", errors = "ignore"))

Now to loop through pages, my logic is this code

from bs4 import BeautifulSoup
import urllib
import urllib.request
import os
from string import ascii_lowercase

def make_soup(url):
    thepage = urllib.request.urlopen(url)
    soupdata = BeautifulSoup(thepage, 'html.parser')
    return soupdata

playerdatasaved = ""
for letter in ascii_lowercase:
    soup = make_soup("https://www.basketball-reference.com/players/" + letter + "/")
    for record in soup.findAll("tr"):
        playerdata = "" 
        for data in record.findAll(["th","td"]): 
            playerdata = playerdata + "," + data.text 

        playerdatasaved = playerdatasaved + "\n" + playerdata[1:]

header = "player, from, to, position, height, weight, dob, year, 
colleges"+"\n"
file = open(os.path.expanduser("basketball.csv"),"wb")
file.write(bytes(header, encoding = "ascii", errors = "ignore"))
file.write(bytes(playerdatasaved[1:], encoding = "ascii", errors = "ignore"))

However, this is running into an error relating to the line: soup = make_soup("https://www.basketball-reference.com/players/" + letter + "/")

Michael
  • 343
  • 2
  • 7

4 Answers4

1

I tried to run your code and ran into a ssl certificate error CERTIFICATE_VERIFY_FAILED which seems to be a problem with the wesite you are trying to scrape and not your code.

Maybe this stack can help clear things: "SSL: certificate_verify_failed" error when scraping https://www.thenewboston.com/

Ariel Ferdman
  • 124
  • 3
  • 11
0
   for letter in ascii_lowercase:
    soup = make_soup("https://www.basketball-reference.com/players/" + letter + "/")

In the url you provided, you are encountering a 404 error when letter = 'x'. Looks like that player index does not exist, make sure you check for that case when going through the letters.

Eman
  • 1
  • 2
0

Agreed with Eman. The page for x is not available. Just use a try-catch blog to ignore that page.

    try:
        soup = make_soup("https://www.basketball-reference.com/players/" + letter + "/")
        for record in soup.findAll("tr"):
            playerdata = "" 
            for data in record.findAll(["th","td"]): 
                playerdata = playerdata + "," + data.text 

            playerdatasaved = playerdatasaved + "\n" + playerdata[1:]
    except Exception as e:
        print(e)
lenhhoxung
  • 2,530
  • 2
  • 30
  • 61
0

To fix your code, first thing we have to do is to turn ascii_lowercase into string so we can run soup = make_soup("https://www.basketball-reference.com/players/" + letter + "/") without major exceptions. Just change your first for to this: for letter in str(ascii_lowercase):.

Next thing is to treat exceptions when we cannot find a page. For example, "https://www.basketball-reference.com/players/x/" does not not exist. For that, we can use try, exception.

And last, but not least, you have to ignore the first line of the table, otherwise you will have lots of Player,From,To,Pos,Ht,Wt,Birth,Date,Colleges in your file. So, do this:

for table in soup.findAll("tbody"):
    for record in table.findAll("tr"):

Instead of this:

for record in soup.findAll("tr"):

Here is the whole thing working:

from bs4 import BeautifulSoup
import urllib
import urllib.request
import os
from string import ascii_lowercase

def make_soup(url):
    thepage = urllib.request.urlopen(url)
    soupdata = BeautifulSoup(thepage, 'html.parser')
    return soupdata

playerdatasaved = ""
for letter in str(ascii_lowercase):
    print(letter) # I added this to see the magic happening
    try:
        soup = make_soup("https://www.basketball-reference.com/players/" + letter + "/")
        for record in soup.findAll("tr"):
            playerdata = "" 
            for data in record.findAll(["th","td"]): 
                playerdata = playerdata + "," + data.text 

            playerdatasaved = playerdatasaved + "\n" + playerdata[1:]
    except:
        pass

header = "player, from, to, position, height, weight, dob, year,colleges"+"\n"
file = open(os.path.expanduser("basketball.csv"),"wb")
file.write(bytes(header, encoding = "ascii", errors = "ignore"))
file.write(bytes(playerdatasaved[1:], encoding = "ascii", errors = "ignore"))
Rafael Barros
  • 139
  • 1
  • 8