0

BeautifulSoup web scrape will not correctly scrape the data from a given column of a table.

It works to get (scrape) all the data in the table EXCEPT for the data in the 'Player' column; the output shows all the player names as 'none'.

Output Data

The only difference in the td element for the data in the 'player' column vs. all other td elements in the tr is that there is a href before the 'td' in the player data element, as displayed in the images below.

inspect in each

td in the 'Player' column

How would i go about changing my code to get the players names? Is it the href in the for the 'Player' data that is screwing my script? If so, how do i account for this?

#HOME_SKATERS
#FIRST_TWO_GAMES

import requests
from bs4 import BeautifulSoup
import pandas as pd
import csv
table = []
df = pd.DataFrame()
for i in range (400959564,400959565):
    url = requests.get("http://www.espn.com/nhl/boxscore?gameId={}".format(i))
    if not url.ok:
        continue
    data = url.text
    soup = BeautifulSoup(data, 'lxml')
    #Add the game ID to the list of soups to keep track of multiple players with same game ID
    table.append((i,soup.find_all('table', {'class' : 'mod-data'})[5].find_all('tr')[2:20]))

data = []
soups = []
game_id = []
for i,t in table:
#Use .contents method to turn the soup into list of items
    soups = [j.contents for j in t]
    for s in soups:
#Use .string method to parse the values of different columns
        data.append([a.string for a in s])
#Append the Game ID
        game_id.append(i)


#Create a DataFrame from the data extracted
df = pd.DataFrame(data)
df.columns = ['Player', 'G', 'A','Plus_Minus', 'SOG', 'MS', 'BS', 'PN', 'PIM', 'HT', 'TK', 'GV', 'SHF', 'TOT', 'PP','SH', 'EV', 'FW', 'FL', 'Faceoff_Pct']
df['Game ID'] = game_id
#df.to_csv('HOME_SKATERS.csv')
df
Joseph K
  • 17
  • 6

1 Answers1

0

Change:

data.append([a.string for a in s])

to

data.append([a.text for a in s])

Outputs:

             Player  G  A Plus_Minus SOG MS BS PN PIM HT    ...     GV SHF  \
0       J. Armia RW  0  0         -2   0  0  1  0   0  1    ...      0  21   
1   D. Byfuglien D   0  1          0   5  1  1  1   2  1    ...      1  29   
2        A. Copp C   0  0         -1   1  0  1  0   0  1    ...      0  18   
3        M. Dano C   0  0         -1   0  0  0  0   0  0    ...      0  14   
4      N. Ehlers LW  0  0         -1   2  1  0  0   0  0    ...      0  20   
5     T. Enstrom D   0  0         -2   1  1  2  0   0  1    ...      0  23   
6     D. Kulikov D   0  0          1   0  0  0  0   0  1    ...      0  20   
7       P. Laine RW  0  1          2   2  1  0  0   0  0    ...      1  23   
8      B. Little C   0  1          0   3  2  0  0   0  0    ...      0  22   
9       A. Lowry LW  0  0          0   4  1  0  0   0  3    ...      1  27   
10   S. Matthias C   0  0         -1   2  0  0  0   0  2    ...      2  23   
11  J. Morrissey D   0  0         -2   2  0  2  1   2  0    ...      0  20   
12      T. Myers D   0  0          0   2  1  2  1   2  2    ...      1  27   
13  M. Perreault C   1  0         -1   2  0  0  0   0  1    ...      0  23   
14  M. Scheifele C   1  0         -1   4  0  0  0   0  0    ...      0  23   
15      B. Tanev LW  0  0         -1   0  0  0  0   0  1    ...      0  15   
16     J. Trouba D   0  0         -4   5  0  4  1   2  4    ...      1  30   
17    B. Wheeler RW  0  0         -1   2  2  1  0   0  0    ...      0  23   

see Difference between .string and .text BeautifulSoup

An example:

from bs4 import BeautifulSoup
data = '<td style="text-align:left;"><a href="http://www.espn.com/nhl/player/_/id/3961/blake-wheeler">B. Wheeler</a> RW</td>'
soup = BeautifulSoup(data, 'lxml')
td = soup.find("td")
print (td.string)
print (td.text)

Outputs:

None
B. Wheeler RW

Because you have "markup" in the "td" element.

Dan-Dev
  • 8,957
  • 3
  • 38
  • 55