I am trying to automate stock price data extraction from https://www.nseindia.com/. Data is stored as a zip file and url for the zip file file varies by date. If on a certain date stock market is closed eg - weekends and holidays, there would be no file/url.
I want to identify invalid links (links that dont exist) and skip to next link.
This is a valid link -
path = 'https://archives.nseindia.com/content/historical/EQUITIES/2021/MAY/cm05MAY2021bhav.csv.zip'
This is an invalid link - (as 1st May is a weekend and stock market is closed for the day)
path2 = 'https://archives.nseindia.com/content/historical/EQUITIES/2021/MAY/cm01MAY2021bhav.csv.zip'
This is what I do to extract the data
from urllib.request import urlopen
from io import BytesIO
from zipfile import ZipFile
import pandas as pd
import datetime
start_date = datetime.date(2021, 5, 3)
end_date = datetime.date(2021, 5, 7)
delta = datetime.timedelta(days=1)
final = pd.DataFrame()
while start_date <= end_date:
print(start_date)
day = start_date.strftime('%d')
month = start_date.strftime('%b').upper()
year = start_date.strftime('%Y')
start_date += delta
path = 'https://archives.nseindia.com/content/historical/EQUITIES/' + year + '/' + month + '/cm' + day + month + year + 'bhav.csv.zip'
file = 'cm' + day + month + year + 'bhav.csv'
try:
with urlopen(path) as f:
with BytesIO(f.read()) as b, ZipFile(b) as myzipfile:
foofile = myzipfile.open(file)
df = pd.read_csv(foofile)
final.append(df)
except:
print(file + 'not there')
If the path is invalid, python is stuck and I have to restart Python. I am not able to error handle or identify invalid link while looping over multiple dates.
What I have tried so far to differentiate between valid and invalid links -
# Attempt 1
import os
os.path.exists(path)
os.path.isfile(path)
os.path.isdir(path)
os.path.islink(path)
# output is False for both Path and Path2
# Attempt 2
import validators
validators.url(path)
# output is True for both Path and Path2
# Attempt 3
import requests
site_ping = requests.get(path)
site_ping.status_code < 400
# Output for Path is True, but Python crashes/gets stuck when I run requests.get(path2) and I have to restart everytime.
Thanks for your help in advance.