-1

I have the following program to scrap data from a website. I want to improve the below code by using a generator with a yield instead of calling generate_url and call_me multiple times sequentially. The purpose of this exersise is to properly understand yield and the context in which it can be used.

import requests                                                                                                                                                                                              
import shutil

start_date='03-03-1997'
end_date='10-04-2015'
yf_base_url ='http://real-chart.finance.yahoo.com/table.csv?s=%5E'
index_list = ['BSESN','NSEI']

def generate_url(index, start_date, end_date):
    s_day = start_date.split('-')[0] 
    s_month = start_date.split('-')[1]
    s_year = start_date.split('-')[2]

    e_day = end_date.split('-')[0] 
    e_month = end_date.split('-')[1]
    e_year = end_date.split('-')[2]
    if (index == 'BSESN') or (index == 'NSEI'):
        url = yf_base_url + index + '&a={}&b={}&c={}&d={}&e={}&f={}'.format(s_day,s_month,s_year,e_day,e_month,e_year)
        return url 

def callme(url,index):
    print('URL {}'.format(url))
    r = requests.get(url, verify=False,stream=True)
    if r.status_code!=200:
        print "Failure!!"
        exit()
    else:
        r.raw.decode_content = True
        with open(index + "file.csv", 'wb') as f:
            shutil.copyfileobj(r.raw, f)
        print "Success"

if __name__ == '__main__':
    url = generate_url(index_list[0],start_date,end_date)
    callme(url,index_list[0])
    url = generate_url(index_list[1],start_date,end_date)
    callme(url,index_list[1])
liv2hak
  • 14,472
  • 53
  • 157
  • 270

2 Answers2

2

There are multiple options. You could use yield to iterate over URL's. Or over request objects.

If your index_list were long, I would suggest yielding URLs. Because then you could use multiprocessing.Pool to map a function that does a request and saves the output over these URLs. That would execute them in parallel, potentially making it a lot faster (assuming that you have enough network bandwidth, and that yahoo finance doesn't throttle connections).

yf ='http://real-chart.finance.yahoo.com/table.csv?s=%5E'
    '{}&a={}&b={}&c={}&d={}&e={}&f={}'
index_list = ['BSESN','NSEI'] 

def genurl(symbols, start_date, end_date):
    # assemble the URLs
    s_day, s_month, s_year = start_date.split('-')
    e_day, e_month, e_year = end_date.split('-')
    for s in symbols:
        url = yf.format(s, s_day,s_month,s_year,e_day,e_month,e_year)
        yield url

def download(url):
    # Do the request, save the file

p = multiprocessing.Pool()
rv = p.map(download, genurl(index_list, '03-03-1997', '10-04-2015'))
Roland Smith
  • 42,427
  • 3
  • 64
  • 94
1

If I understand you correctly, what you want to know is how to change the code so that you can replace the last part by

if __name__ == '__main__':
   for url in generate_url(index_list,start_date,end_date):
       callme(url,index)

If this is correct, you need to change generate_url, but not callme. Changing generate_url is rather mechanical. Make the first parameter index_list instead of index, wrap the function body in a for index in index_list loop, and change return url to yield url.

You don't need to change callme because you never want to say something like for call in callme(...). You won't do anything with it but a normal function call.

saulspatz
  • 5,011
  • 5
  • 36
  • 47