Where am I going wrong with this scraping?

Question

It should be really simple, but I'm struggling to pull out each row from this NCAA table (e.g. Florida State, ACC, 22-1-2') etc.

I guess my main question here is, where do I start? What am I looking for? Do I search for the 'div' tag, or the 'tbody' tag or the 'tr' tag - either one i try with find_all or find or even select using the CSS selector, returns nothing.

https://www.ncaa.com/rankings/soccer-women/d1/ncaa-womens-soccer-rpi

Edit: Managed to get it, see below:

from bs4 import BeautifulSoup
import requests
import csv

url = 'https://www.ncaa.com/rankings/soccer-women/d1/ncaa-womens-soccer-rpi'

result = requests.get(url)

soup = BeautifulSoup(result.text,'html.parser')

check = soup.find_all('tr')

names_lst = []
conference_lst = []
record_lst = []


for info in check[1:]:
    details = info.find_all('td')
    names = details[1].text.strip()
    conference = details[2].text.strip()
    record = details[3].text.strip()

    names_lst.append(names)
    conference_lst.append(conference)
    record_lst.append(record)

print(names_lst)
print(conference_lst)
print(record_lst)

with open ('ncaa_rankings.csv', 'w') as ncaa_file:
    csv_writer = csv.writer(ncaa_file)
    for names, conference, record in zip(names_lst, conference_lst, record_lst):
        csv_writer.writerow([names, conference, record])

Does this answer your question? [Python BeautifulSoup scrape tables](https://stackoverflow.com/questions/18966368/python-beautifulsoup-scrape-tables) — Jonathan, Aug 30 '22 at 23:23
@platipus_on_fire I've added my code - I eventually managed to get it through trial and error, let me know if there are better ways to go about it. Thanks! — Jean-Paul Azzopardi, Sep 04 '22 at 14:57

score 0 · Answer 1 · answered Sep 04 '22 at 15:23

This problem is solvable with 5 lines of code:

import pandas as pd

url = "https://www.ncaa.com/rankings/soccer-women/d1/ncaa-womens-soccer-rpi"
df = pd.read_html(url)[0]
df.to_csv("w_soccer_rpi.csv")
print(df)

Result (also saved in a csv file):

Rank    School  Conference  Record  Road    Neutral Home    Non Div I
0   1   Florida St. ACC 22-1-2  6-1-1   4-0-0   12-0-1  0-0-0
1   2   Duke    ACC 16-4-1  4-1-1   0-0-0   12-3-0  0-0-0
2   3   Arkansas    SEC 19-4-1  4-3-1   4-1-0   11-0-0  0-0-0
3   4   Rutgers Big Ten 19-4-2  6-1-0   0-1-0   13-2-2  0-0-0
4   5   Michigan    Big Ten 18-4-3  5-3-2   1-0-0   12-1-1  0-0-0
... ... ... ... ... ... ... ... ...
337 338 Nicholls    Southland   0-18-0  0-10-0  0-2-0   0-6-0   0-0-0
338 339 Delaware St.    DI Independent  2-11-1  1-6-0   0-0-0   1-5-1   1-0-0
339 340 Mississippi Val.    SWAC    0-13-0  0-7-0   0-1-0   0-5-0   0-0-0
340 341 Hampton Big South   1-13-1  0-8-0   0-0-0   1-5-1   0-0-0
341 342 South Carolina St.  DI Independent  0-10-1  0-4-1   0-0-0   0-6-0   2-1-0

Relevant pandas documentation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_html.html

Where am I going wrong with this scraping?

1 Answers1