Saving / Editing Scrapped URLs to Directory

Question

I have successfully scrapped links from a website and I want to save them to a local folder already created called "HerHoops" for parsing later. In the past, I have successfully done this, but this website's links need a little more cleaning up.

So far this is my code. I want to keep everything after "box_score" in the link so that the saved filename includes the date and teams playing. Also saved in write mode ("w+").

url = f"https://herhoopstats.com/stats/wnba/schedule_date/2004/6/1/"
data = requests.get(url)
soup = BeautifulSoup(data.text)
matchup_table = soup.find_all("div", {"class": "schedule"})[0]

links = matchup_table.find_all('a')
links = [l.get("href") for l in links]
links = [l for l in links if '/box_score/' in l]

box_scores_urls = [f"https://herhoopstats.com{l}" for l in links]

for box_scores_url in box_scores_urls:
      data = requests.get(box_scores_url)
      # within loop opening up page and saving to folder in write mode
      with open("HerHoops/{}".format(box_scores_url[46:]), "w+") as f:
         # write to the files
         f.write(data.text) 
      time.sleep(3)

The error is

FileNotFoundError: [Errno 2] No such file or directory: 'HerHoops/2004/06/01/new-york-liberty-vs-charlotte-sting/'

Does this answer your question? [open() gives FileNotFoundError / IOError: '\[Errno 2\] No such file or directory'](https://stackoverflow.com/questions/12201928/open-gives-filenotfounderror-ioerror-errno-2-no-such-file-or-directory) — HedgeHog, May 04 '23 at 09:58

Abhay Chaudhary · Accepted Answer · 2023-05-04T13:43:38.580

1

From the error itself its clear that you are trying to write to the file 'HerHoops/2004/06/01/new-york-liberty-vs-charlotte-sting/', but part of the directory does not exist We can create the necessary directories by using the os.makedirs() function before writing to the file

Full code

import os
import time
import requests
from bs4 import BeautifulSoup
import re
from datetime import datetime

url = f"https://herhoopstats.com/stats/wnba/schedule_date/2004/6/1/"
data = requests.get(url)
soup = BeautifulSoup(data.text)
matchup_table = soup.find_all("div", {"class": "schedule"})[0]

links = matchup_table.find_all('a')
links = [l.get("href") for l in links]
links = [l for l in links if '/box_score/' in l]

box_scores_urls = [f"https://herhoopstats.com{l}" for l in links]

for box_scores_url in box_scores_urls:
    data = requests.get(box_scores_url)
    # extract date and teams from the box_scores_url
    date_str = datetime.strptime(re.sub(r'\D', '', url), "%Y%m%d").strftime("%Y-%m-%d")
    # Get the latter part of the string using slicing
    match = re.search(r'\d+(?!.*\d)', box_scores_url.replace('/', ''))
    teams_str = box_scores_url.replace('/', '')[match.end():]
    # create the directory if it doesn't exist
    directory = f"HerHoops/"
    os.makedirs(directory, exist_ok=True)
    # within loop opening up page and saving to folder in write mode
    with open(f"{directory}{date_str}-{teams_str}.html", "w+") as f:
        # write to the file
        f.write(data.text)
    time.sleep(3)

edited May 04 '23 at 13:43

answered May 04 '23 at 10:16

Abhay Chaudhary

1,763
1
8
13

Thank you for this as it has gotten me closer to the desired outcome. However, this has now created new folders within this new folder titled that then require a further step to reach a file called index in another newly created folder. Also, I would like to keep the directory name "HerHoops" as this specific date is a test before creating a larger loop through the whole season. – kc_balr May 04 '23 at 12:31
so you want to make path to file as "HerHoops/2004-06-01/houston-comets-vs-phoenix-mercury/index.html" ? – Abhay Chaudhary May 04 '23 at 12:57
Looking to make a path to the folder "HerHoops" with filenames that include the exact date and the teams involved i.e. (2004-06-01/houston-comets-vs-phoenix-mercury.html). Apologies as I have been unclear. Appreciate the patience and help. – kc_balr May 04 '23 at 13:10
try the updated code path should be now as "HerHoops/2004-06-01/houston-comets-vs-phoenix-mercury.html" – Abhay Chaudhary May 04 '23 at 13:22
It seems like it still created a nested folder within HerHoops. In my explorer, it shows up as >HerHoops\2004-06-01 then the html on the dropdown. Instead of just a folder >HerHoops with htmls. – kc_balr May 04 '23 at 13:32
try now path should be "HerHoops/2004-06-01-houston-comets-vs-phoenix-mercury.html" only single folder – Abhay Chaudhary May 04 '23 at 13:44
Ok this latest edit works and just saves under the folder HerHoops which is the goal! Thank you very much. – kc_balr May 05 '23 at 07:10

Saving / Editing Scrapped URLs to Directory

1 Answers1