BeautifulSoup Table Scraping

Question

I am trying to scrape this site for the starting lineups. https://www.cbssports.com/nhl/teams/BOS/boston-bruins/depth-chart/

I am using the following code, but the table that is printed contains information I do not want, such as the player shortname and player news. I would only like to extract the CellPlayerName--long, but I am unsure how to do that.

url = "https://www.cbssports.com/nhl/teams/BOS/boston-bruins/depth-chart"
data = requests.get(url).text
soup = BeautifulSoup(data, 'html.parser')
df = pd.read_html(str(soup.find_all('table')))
df[0]

It prints the following:

POS	Starter	Second	Third+
Center	P. Bergeron Bruins' Patrice Bergeron: Pots winner in NJ Patrice Bergeron Bruins' Patrice Bergeron: Pots winner in NJ	D. KrejciDavid Krejci	C. CoyleCharlie CoyleT. Nosek Bruins' Tomas Nosek: Returning Monday Undisclosed: Expected to be out until at least Jan 2 Tomas Nosek Bruins' Tomas Nosek: Returning Monday Undisclosed: Expected to be out until at least Jan 2 M. Filipe Lower Body: IR. Expected to be out until at least Jan 29 Matt Filipe Lower Body: IR. Expected to be out until at least Jan 29
Left Wing	B. Marchand Bruins' Brad Marchand: Two points against Buffalo Brad Marchand Bruins' Brad Marchand: Two points against Buffalo	P. ZachaPavel Zacha	T. HallTaylor HallN. FolignoNick FolignoA. GreerA.J. Greer
Right Wing	J. DeBruskJake DeBrusk	D. Pastrnak Bruins' David Pastrnak: Another two-point performance David Pastrnak Bruins' David Pastrnak: Another two-point performance	T. Frederic Bruins' Trent Frederic: Scores goal Wednesday Trent Frederic Bruins' Trent Frederic: Scores goal Wednesday C. SmithCraig Smith
Left Defenseman	H. LindholmHampus Lindholm	M. GrzelcykMatt Grzelcyk	D. ForbortDerek ForbortJ. ZborilJakub Zboril
Right Defenseman	C. McAvoyCharlie McAvoy	B. CarloBrandon Carlo	C. CliftonConnor Clifton
Goalie	L. Ullmark Bruins' Linus Ullmark: Staring in Winter Classic Linus Ullmark Bruins' Linus Ullmark: Staring in Winter Classic	J. Swayman Bruins' Jeremy Swayman: Falls short in OT Jeremy Swayman Bruins' Jeremy Swayman: Falls short in OT	—

Edit: This is the desired output

POS	Starter	Second	Third
Center	Patrice Bergeron	David Krejci	Charlie Coyle Tomas Nosek Matt Filipe
Left Wing	Brad Marchand	Pavel Zacha	Taylor Hall Nick Foligno A.J. Greer
Right Wing	Jake DeBrusk	David Pastrnak	Trent Frederic Craig Smith
Left Defenseman	Hampus Lindholm	Matt Grzelcyk	Derek Forbort Jakub Zboril
Right Defenseman	Charlie McAvoy	Brandon Carlo	Connor Clifton
Goalie	Linus Ullmark	Jeremy Swayman

Does this answer your question? [Delete a column from a Pandas DataFrame](https://stackoverflow.com/questions/13411544/delete-a-column-from-a-pandas-dataframe) and/or https://stackoverflow.com/questions/43643506/select-columns-based-on-columns-names-containing-a-specific-string-in-pandas — JonSG, Jan 02 '23 at 17:24
@JonSG Definitely not the correct duplicate; the DF is created with the correct columns but there is superfluous information *within* the cells themselves that the OP is looking to remove. — esqew, Jan 02 '23 at 17:26
I’m on mobile so limited opportunity for me to formulate a complete answer for a bit but this should be possible to do by modifying the loaded HTML document up front with the BS4 instance you already have — esqew, Jan 02 '23 at 17:30
@esqew "player short name" and "player news" and "player name - long" all look like column names to me. None appearing in the given output so it looks to me like the OP wants to drop a column or perhaps select a column. — JonSG, Jan 02 '23 at 17:31
I have edited the initial question to include the desired output as it looks like it was confusing people. I do not want to drop a column, I want to eliminate the superfluous information as esqew mentions — user1389739, Jan 02 '23 at 17:40

Muhammad Khuzaima Umair · Accepted Answer · 2023-01-02T17:54:59.673

There is some text present in cell <td> but is not displayed, because, that is hidden using CSS (and varies depending on device - my obsevation). The expected result can be obtained by removing extra information from the table.

The code should be like this:

import requests
from bs4 import BeautifulSoup
import pandas as pd


url = "https://www.cbssports.com/nhl/teams/BOS/boston-bruins/depth-chart"
data = requests.get(url).text
soup = BeautifulSoup(data, 'html.parser')

# remove the colgroup
soup.find('colgroup').decompose()   # remove the colgroup 


# remove the class 'CellPlayerName--short', because they are needed in output
short = soup.find_all('span', class_='CellPlayerName--short')
for s in short:
    s.decompose()



# Also delete class name 'CellPlayerName-icon'
icon = soup.find_all('span', class_='CellPlayerName-icon')
for i in icon:
    i.decompose()

# You can also add the code to remove the cell if it contains only ' - '

df = pd.read_html(str(soup.find_all('table')))
print(df[0])

Output:

POS	Starter	Second	Third+
Center	Patrice Bergeron	David Krejci	Charlie CoyleTomas NosekMatt Filipe
Left Wing	Brad Marchand	Pavel Zacha	Taylor HallNick FolignoA.J. Greer
Right Wing	Jake DeBrusk	David Pastrnak	Trent FredericCraig Smith
Left Defenseman	Hampus Lindholm	Matt Grzelcyk	Derek ForbortJakub Zboril
Right Defenseman	Charlie McAvoy	Brandon Carlo	Connor Clifton
Goalie	Linus Ullmark	Jeremy Swayman	—

Edit 1:

Included the removal the text of class CellPlayerName-icon.

BeautifulSoup Table Scraping

1 Answers1

Output:

Edit 1: