0

I am trying to automate stock price data extraction from https://www.nseindia.com/. Data is stored as a zip file and url for the zip file file varies by date. If on a certain date stock market is closed eg - weekends and holidays, there would be no file/url.

I want to identify invalid links (links that dont exist) and skip to next link.

This is a valid link -
path = 'https://archives.nseindia.com/content/historical/EQUITIES/2021/MAY/cm05MAY2021bhav.csv.zip'

This is an invalid link - (as 1st May is a weekend and stock market is closed for the day)
path2 = 'https://archives.nseindia.com/content/historical/EQUITIES/2021/MAY/cm01MAY2021bhav.csv.zip'

This is what I do to extract the data

from urllib.request import urlopen
from io import BytesIO
from zipfile import ZipFile
import pandas as pd
import datetime

start_date = datetime.date(2021, 5, 3)
end_date = datetime.date(2021, 5, 7)
delta = datetime.timedelta(days=1)
final = pd.DataFrame()

while start_date <= end_date:
    print(start_date)
    day = start_date.strftime('%d')
    month = start_date.strftime('%b').upper()
    year = start_date.strftime('%Y')
    start_date += delta
    path = 'https://archives.nseindia.com/content/historical/EQUITIES/'  + year + '/' + month + '/cm' + day + month + year + 'bhav.csv.zip'
    file = 'cm' + day + month + year + 'bhav.csv' 
    try:
        with urlopen(path) as f: 
            with BytesIO(f.read()) as b, ZipFile(b) as myzipfile:
                foofile = myzipfile.open(file)
                df = pd.read_csv(foofile)
                final.append(df)
    except:
        print(file + 'not there')

If the path is invalid, python is stuck and I have to restart Python. I am not able to error handle or identify invalid link while looping over multiple dates.

What I have tried so far to differentiate between valid and invalid links -

# Attempt 1
import os
os.path.exists(path)
os.path.isfile(path)
os.path.isdir(path)
os.path.islink(path)

# output is False for both Path and Path2

# Attempt 2
import validators
validators.url(path)

# output is True for both Path and Path2

# Attempt 3
import requests
site_ping = requests.get(path)
site_ping.status_code < 400

# Output for Path is True, but Python crashes/gets stuck when I run requests.get(path2) and I have to restart everytime.

Thanks for your help in advance.

Pratik
  • 131
  • 1
  • 9

1 Answers1

0

As suggested by SuperStormer - adding a timeout to the request solved the issue

try:
        with urlopen(zipFileURL, timeout = 5) as f: 
            with BytesIO(f.read()) as b, ZipFile(b) as myzipfile:
                foofile = myzipfile.open(file)
                df = pd.read_csv(foofile)
                final.append(df)
    except:
        print(file + 'not there')
Pratik
  • 131
  • 1
  • 9