1

I am trying to web scrape, by using Python 3, a chart off of this website into a .csv file: 2016 NBA National TV Schedule

The chart starts out like:

Tuesday, October 25
8:00 PM Knicks/Cavaliers TNT
10:30 PM Spurs/Warriors TNT
Wednesday, October 26
8:00 PM Thunder/Sixers ESPN
10:30 PM Rockets/Lakers ESPN

I am using these packages:

from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np

The output I want in a .csv file looks like this:

enter image description here

These are the first six lines of the chart on the website into the .csv file. Notice how multiple dates are used more than once. How do I implement the scraper to get this output?

Nick
  • 207
  • 1
  • 2
  • 11
  • You will need to create a two-level parser. Outer - simple split. Inner - a straight forward regex. First level - line start with a letter. Second level - lines start with a digit, – PM 77-1 May 21 '20 at 19:17

1 Answers1

2
import re
import requests
import pandas as pd
from bs4 import BeautifulSoup
from itertools import groupby

url = 'https://fansided.com/2016/08/11/nba-schedule-2016-national-tv-games/'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')

days = 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'
data = soup.select_one('.article-content p:has(br)').get_text(strip=True, separator='|').split('|')

dates, last = {}, ''
for v, g in groupby(data, lambda k: any(d in k for d in days)):
    if v:
        last = [*g][0]
        dates[last] = []
    else:
        dates[last].extend([re.findall(r'([\d:]+ [AP]M) (.*?)/(.*?) (.*)', d)[0] for d in g])

all_data = {'Date':[], 'Time': [], 'Team 1': [], 'Team 2': [], 'Network': []}
for k, v in dates.items():
    for time, team1, team2, network in v:
        all_data['Date'].append(k)
        all_data['Time'].append(time)
        all_data['Team 1'].append(team1)
        all_data['Team 2'].append(team2)
        all_data['Network'].append(network)

df = pd.DataFrame(all_data)
print(df)

df.to_csv('data.csv')

Prints:

                      Date      Time    Team 1     Team 2 Network
0      Tuesday, October 25   8:00 PM    Knicks  Cavaliers     TNT
1      Tuesday, October 25  10:30 PM     Spurs   Warriors     TNT
2    Wednesday, October 26   8:00 PM   Thunder     Sixers    ESPN
3    Wednesday, October 26  10:30 PM   Rockets     Lakers    ESPN
4     Thursday, October 27   8:00 PM   Celtics      Bulls     TNT
..                     ...       ...       ...        ...     ...
159      Saturday, April 8   8:30 PM  Clippers      Spurs     ABC
160       Monday, April 10   8:00 PM   Wizards    Pistons     TNT
161       Monday, April 10  10:30 PM   Rockets   Clippers     TNT
162    Wednesday, April 12   8:00 PM     Hawks     Pacers    ESPN
163    Wednesday, April 12  10:30 PM  Pelicans    Blazers    ESPN

[164 rows x 5 columns]

And saves data.csv (screenshot from Libre Office):

enter image description here

Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91