how do I web scrape a table with beautifulsoup?

Question

This might not be the smartest question but I've spent about an hour trying to figure out and doing research ending up with nothing. As a last resort I am posting my problem here.

The website I am using is https://en.wikipedia.org/wiki/List_of_Super_Bowl_halftime_shows and I want to scrape the tables listed under history.

When I inspect the page I see that it is under an anchor tag with specific titles

I do not mind scraping each table individually/manually but no matter how I try to navigate to the table with its respective anchor and title, my bs(beautifulsoup) object does not have any contents of the table.

I'm guessing the href attribute is used to display the table so my question is how can I scrape the contents of a webpage that is using another link that I do not have access to?

i see it is referencing a link to display the table. maybe im wrong. but i dont see how i can access the table because it seems the link that is used on wikipedia has to do with some local path/link for the user that posted it. — swordlordswamplord, Dec 27 '21 at 04:29
yes i am using pandas for this project. u can assume i know the basics of data science with python — swordlordswamplord, Dec 27 '21 at 04:32
this is actually a "project" on the datacamp website but the problem with it is that it just offers up all the libraries/data and i want to do it all from scratch because i expect that is how it will be done realistically in a job setting or if i want to explore anything on my own so i am trying to do the websites project on my own instead of having my hand held all the way cus ill learn nothing — swordlordswamplord, Dec 27 '21 at 04:34

score 0 · Answer 1 · answered Dec 27 '21 at 04:35

0

Since you are using pandas you can use read_html() to get all tables and access specific tables using indexing.

import pandas as pd


df = pd.read_html("https://en.wikipedia.org/wiki/List_of_Super_Bowl_halftime_shows")
print(df[0].to_string()) # <-- Acess the first table

answered Dec 27 '21 at 04:35

MendelG

14,885
4
25
52

TypeError: cannot parse from 'Response' – swordlordswamplord Dec 27 '21 at 04:36
How are you using `read_html`? pass it a URL as I do – MendelG Dec 27 '21 at 04:37
URLError: – swordlordswamplord Dec 27 '21 at 04:37
the first error was my mistake i did not use the url. but i litearlly copy and pasted your suggestion and i get this error – swordlordswamplord Dec 27 '21 at 04:38
Don't use `urlopen`/`requests` use `pandas` directly – MendelG Dec 27 '21 at 04:38
yes i used pandas directly and i get this url error – swordlordswamplord Dec 27 '21 at 04:40
@GirthyLampost see https://stackoverflow.com/questions/44629631/while-using-pandas-got-error-urlopen-error-ssl-certificate-verify-failed-cert – MendelG Dec 27 '21 at 04:41
okay i imported ssl and made the context but how does that help? – swordlordswamplord Dec 27 '21 at 04:46
first time ive seen ssl and im looking at the documentation for it but dont know how it applies to my error – swordlordswamplord Dec 27 '21 at 04:46
@GirthyLampost isn't that the error you received? – MendelG Dec 27 '21 at 05:17

how do I web scrape a table with beautifulsoup?

1 Answers1