-1

I am trying to scrape this site for the starting lineups. https://www.cbssports.com/nhl/teams/BOS/boston-bruins/depth-chart/

I am using the following code, but the table that is printed contains information I do not want, such as the player shortname and player news. I would only like to extract the CellPlayerName--long, but I am unsure how to do that.

url = "https://www.cbssports.com/nhl/teams/BOS/boston-bruins/depth-chart"
data = requests.get(url).text
soup = BeautifulSoup(data, 'html.parser')
df = pd.read_html(str(soup.find_all('table')))
df[0]

It prints the following:

POS Starter Second Third+
Center P. Bergeron Bruins' Patrice Bergeron: Pots winner in NJ Patrice Bergeron Bruins' Patrice Bergeron: Pots winner in NJ D. KrejciDavid Krejci C. CoyleCharlie CoyleT. Nosek Bruins' Tomas Nosek: Returning Monday Undisclosed: Expected to be out until at least Jan 2 Tomas Nosek Bruins' Tomas Nosek: Returning Monday Undisclosed: Expected to be out until at least Jan 2 M. Filipe Lower Body: IR. Expected to be out until at least Jan 29 Matt Filipe Lower Body: IR. Expected to be out until at least Jan 29
Left Wing B. Marchand Bruins' Brad Marchand: Two points against Buffalo Brad Marchand Bruins' Brad Marchand: Two points against Buffalo P. ZachaPavel Zacha T. HallTaylor HallN. FolignoNick FolignoA. GreerA.J. Greer
Right Wing J. DeBruskJake DeBrusk D. Pastrnak Bruins' David Pastrnak: Another two-point performance David Pastrnak Bruins' David Pastrnak: Another two-point performance T. Frederic Bruins' Trent Frederic: Scores goal Wednesday Trent Frederic Bruins' Trent Frederic: Scores goal Wednesday C. SmithCraig Smith
Left Defenseman H. LindholmHampus Lindholm M. GrzelcykMatt Grzelcyk D. ForbortDerek ForbortJ. ZborilJakub Zboril
Right Defenseman C. McAvoyCharlie McAvoy B. CarloBrandon Carlo C. CliftonConnor Clifton
Goalie L. Ullmark Bruins' Linus Ullmark: Staring in Winter Classic Linus Ullmark Bruins' Linus Ullmark: Staring in Winter Classic J. Swayman Bruins' Jeremy Swayman: Falls short in OT Jeremy Swayman Bruins' Jeremy Swayman: Falls short in OT

Edit: This is the desired output

POS Starter Second Third
Center Patrice Bergeron David Krejci Charlie Coyle Tomas Nosek Matt Filipe
Left Wing Brad Marchand Pavel Zacha Taylor Hall Nick Foligno A.J. Greer
Right Wing Jake DeBrusk David Pastrnak Trent Frederic Craig Smith
Left Defenseman Hampus Lindholm Matt Grzelcyk Derek Forbort Jakub Zboril
Right Defenseman Charlie McAvoy Brandon Carlo Connor Clifton
Goalie Linus Ullmark Jeremy Swayman
user1389739
  • 111
  • 1
  • 2
  • 15
  • Does this answer your question? [Delete a column from a Pandas DataFrame](https://stackoverflow.com/questions/13411544/delete-a-column-from-a-pandas-dataframe) and/or https://stackoverflow.com/questions/43643506/select-columns-based-on-columns-names-containing-a-specific-string-in-pandas – JonSG Jan 02 '23 at 17:24
  • 1
    @JonSG Definitely not the correct duplicate; the DF is created with the correct columns but there is superfluous information *within* the cells themselves that the OP is looking to remove. – esqew Jan 02 '23 at 17:26
  • I’m on mobile so limited opportunity for me to formulate a complete answer for a bit but this should be possible to do by modifying the loaded HTML document up front with the BS4 instance you already have – esqew Jan 02 '23 at 17:30
  • @esqew "player short name" and "player news" and "player name - long" all look like column names to me. None appearing in the given output so it looks to me like the OP wants to drop a column or perhaps select a column. – JonSG Jan 02 '23 at 17:31
  • I have edited the initial question to include the desired output as it looks like it was confusing people. I do not want to drop a column, I want to eliminate the superfluous information as esqew mentions – user1389739 Jan 02 '23 at 17:40

1 Answers1

0

There is some text present in cell <td> but is not displayed, because, that is hidden using CSS (and varies depending on device - my obsevation). The expected result can be obtained by removing extra information from the table.

The code should be like this:

import requests
from bs4 import BeautifulSoup
import pandas as pd


url = "https://www.cbssports.com/nhl/teams/BOS/boston-bruins/depth-chart"
data = requests.get(url).text
soup = BeautifulSoup(data, 'html.parser')

# remove the colgroup
soup.find('colgroup').decompose()   # remove the colgroup 


# remove the class 'CellPlayerName--short', because they are needed in output
short = soup.find_all('span', class_='CellPlayerName--short')
for s in short:
    s.decompose()



# Also delete class name 'CellPlayerName-icon'
icon = soup.find_all('span', class_='CellPlayerName-icon')
for i in icon:
    i.decompose()

# You can also add the code to remove the cell if it contains only ' - '

df = pd.read_html(str(soup.find_all('table')))
print(df[0])

Output:

POS Starter Second Third+
Center Patrice Bergeron David Krejci Charlie CoyleTomas NosekMatt Filipe
Left Wing Brad Marchand Pavel Zacha Taylor HallNick FolignoA.J. Greer
Right Wing Jake DeBrusk David Pastrnak Trent FredericCraig Smith
Left Defenseman Hampus Lindholm Matt Grzelcyk Derek ForbortJakub Zboril
Right Defenseman Charlie McAvoy Brandon Carlo Connor Clifton
Goalie Linus Ullmark Jeremy Swayman

Edit 1:

Included the removal the text of class CellPlayerName-icon.