129

I am trying to automate download of historic stock data using python. The URL I am trying to open responds with a CSV file, but I am unable to open using urllib2. I have tried changing user agent as specified in few questions earlier, I even tried to accept response cookies, with no luck. Can you please help.

Note: The same method works for yahoo Finance.

Code:

import urllib2,cookielib

site= "http://www.nseindia.com/live_market/dynaContent/live_watch/get_quote/getHistoricalData.jsp?symbol=JPASSOCIAT&fromDate=1-JAN-2012&toDate=1-AUG-2012&datePeriod=unselected&hiddDwnld=true"

hdr = {'User-Agent':'Mozilla/5.0'}

req = urllib2.Request(site,headers=hdr)

page = urllib2.urlopen(req)

Error

File "C:\Python27\lib\urllib2.py", line 527, in http_error_default raise HTTPError(req.get_full_url(), code, msg, hdrs, fp) urllib2.HTTPError: HTTP Error 403: Forbidden

Thanks for your assistance

Sudar
  • 18,954
  • 30
  • 85
  • 131
kumar
  • 2,570
  • 2
  • 17
  • 18

6 Answers6

202

By adding a few more headers I was able to get the data:

import urllib2,cookielib

site= "http://www.nseindia.com/live_market/dynaContent/live_watch/get_quote/getHistoricalData.jsp?symbol=JPASSOCIAT&fromDate=1-JAN-2012&toDate=1-AUG-2012&datePeriod=unselected&hiddDwnld=true"
hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
       'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
       'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
       'Accept-Encoding': 'none',
       'Accept-Language': 'en-US,en;q=0.8',
       'Connection': 'keep-alive'}

req = urllib2.Request(site, headers=hdr)

try:
    page = urllib2.urlopen(req)
except urllib2.HTTPError, e:
    print e.fp.read()

content = page.read()
print content

Actually, it works with just this one additional header:

'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
andrean
  • 6,717
  • 2
  • 36
  • 43
  • Which of these headers to you think was missing from the origional request? –  Nov 09 '12 at 07:20
  • 1
    wireshark showed that only the User-Agent was sent, along with Connection: close, Host: www.nseindia.com, Accept-Encoding: identity – andrean Nov 09 '12 at 07:26
  • Andrean, thank you very much, it solved the issue, unfortunate and funny that I tried all headers except 'Accept' before posting here. – kumar Nov 09 '12 at 11:32
  • 1
    You're welcome, well what I really did is I checked the url from your script in a browser, and as it worked there, I just copied all the request headers the browser sent, and added them here, and that was the solution. – andrean Nov 09 '12 at 12:34
  • Thank you!! All of my requests were getting blocked from various forums, and this solved my problem. I think this should definitely be posted along with setting the User-Agent as a solution to the 403 error; This happened to me on numerous sites (I think most of them were running myBB). – araisbec Mar 06 '13 at 17:24
  • @andrean How can I do this is python3 with urllib? – UserYmY Jan 19 '15 at 21:04
  • 1
    @Mee did you take a look at the answer below? it was addressed specifically for python 3, check if it works for you... – andrean Jan 19 '15 at 21:07
  • @andrean I still get this error when I use the below solution. am trying to get googlepageRanke. raise HTTPError(req.full_url, code, msg, hdrs, fp) urllib.error.HTTPError: HTTP Error 403: Forbidden – UserYmY Jan 19 '15 at 21:24
  • 1
    try adding the other headers (from my answer) as well to the request. still there are many other reasons why a server might return a 403, check out the other answers on the topic as well. as for the target, google especially is a tough one, kinda hard to scrape, they have implemented many methods to prevent scraping. – andrean Jan 20 '15 at 06:40
  • i was trying to download different url, for that it worked after removing Connection: Keep Alive. url : https://www.nseindia.com/content/historical/EQUITIES/2017/FEB/cm08FEB2017bhav.csv.zip – Prabu Feb 08 '17 at 15:47
  • I just need the user-agent to replace my previous old one. – shaosh Oct 09 '19 at 03:18
  • The code is working on the local but not working on the EC2 instance. Can you help me here? – neel Dec 29 '19 at 14:50
  • This worked in 2021 but now gets a 403 again. Outdated browser or something? – endolith Jan 28 '23 at 17:33
  • (Looks like the site I'm scraping is now behind cloudfare, so I need https://pypi.org/project/cloudscraper/) – endolith Jan 28 '23 at 18:00
74

This will work in Python 3

import urllib.request

user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7'

url = "http://en.wikipedia.org/wiki/List_of_TCP_and_UDP_port_numbers"
headers={'User-Agent':user_agent,} 

request=urllib.request.Request(url,None,headers) #The assembled request
response = urllib.request.urlopen(request)
data = response.read() # The data u need
Eish
  • 1,051
  • 1
  • 9
  • 16
  • 5
    It's true that some sites (including Wikipedia) block on common non-browser user agents strings, like the "Python-urllib/x.y" sent by Python's libraries. Even a plain "Mozilla" or "Opera" is usually enough to bypass that. This doesn't apply to the original question, of course, but it's still useful to know. – efotinis Jul 28 '13 at 09:19
12

NSE website has changed and the older scripts are semi-optimum to current website. This snippet can gather daily details of security. Details include symbol, security type, previous close, open price, high price, low price, average price, traded quantity, turnover, number of trades, deliverable quantities and ratio of delivered vs traded in percentage. These conveniently presented as list of dictionary form.

Python 3.X version with requests and BeautifulSoup

from requests import get
from csv import DictReader
from bs4 import BeautifulSoup as Soup
from datetime import date
from io import StringIO 

SECURITY_NAME="3MINDIA" # Change this to get quote for another stock
START_DATE= date(2017, 1, 1) # Start date of stock quote data DD-MM-YYYY
END_DATE= date(2017, 9, 14)  # End date of stock quote data DD-MM-YYYY


BASE_URL = "https://www.nseindia.com/products/dynaContent/common/productsSymbolMapping.jsp?symbol={security}&segmentLink=3&symbolCount=1&series=ALL&dateRange=+&fromDate={start_date}&toDate={end_date}&dataType=PRICEVOLUMEDELIVERABLE"




def getquote(symbol, start, end):
    start = start.strftime("%-d-%-m-%Y")
    end = end.strftime("%-d-%-m-%Y")

    hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
         'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
         'Referer': 'https://cssspritegenerator.com',
         'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
         'Accept-Encoding': 'none',
         'Accept-Language': 'en-US,en;q=0.8',
         'Connection': 'keep-alive'}

    url = BASE_URL.format(security=symbol, start_date=start, end_date=end)
    d = get(url, headers=hdr)
    soup = Soup(d.content, 'html.parser')
    payload = soup.find('div', {'id': 'csvContentDiv'}).text.replace(':', '\n')
    csv = DictReader(StringIO(payload))
    for row in csv:
        print({k:v.strip() for k, v in row.items()})


 if __name__ == '__main__':
     getquote(SECURITY_NAME, START_DATE, END_DATE)

Besides this is relatively modular and ready to use snippet.

Supreet Sethi
  • 1,780
  • 14
  • 24
  • Thanks, man! this worked for me instead of above answer from @andrean – Nitish Kumar Pal Jan 03 '18 at 08:14
  • Hi, I really don't know where to bang my head anymore, I've tried this solution and many more but I keep getting error 403. Is there anything else I can try? – Francesco Feb 22 '18 at 23:03
  • 403 status is meant to inform that your browser is not authenticated to use this service. It may be that in your case, it genuinely requires authentication with basic auth, oauth etc. – Supreet Sethi Feb 23 '18 at 23:35
7

This error usually occurs when the server you are requesting doesn't know where the request is coming from, the server does this to avoid any unwanted visit. You could bypass this error by defining a header and passing it along the urllib.request

Heres code:

#defining header
header= {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) ' 
      'AppleWebKit/537.11 (KHTML, like Gecko) '
      'Chrome/23.0.1271.64 Safari/537.11',
      'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
      'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
      'Accept-Encoding': 'none',
      'Accept-Language': 'en-US,en;q=0.8',
      'Connection': 'keep-alive'}

#the URL where you are requesting at
req = urllib.request.Request(url=your_url, headers=header) 
page = urllib.request.urlopen(req).read()
archit jain
  • 179
  • 2
  • 6
2

There is one thing worth trying is just to update the python version. One of my crawling scripts stopped working with 403 on Windows 10 a few months back. Any user_agents did not help and I was about to give up the script. Today I tried the same script on Ubuntu with Python (3.8.5 - 64 bit) and it worked with no error. The python version of Windows was a bit old as 3.6.2 - 32 bit. After upgrading the python on Windows 10 to 3.9.5 - 64bit, I don't see the 403 any longer. If you give it a try, don't forget to run 'pip freeze > requirements.txt" to export package entries. I forgot it of course. This post is a reminder for me too when the 403 comes back again in the future.

1
import urllib.request

bank_pdf_list = ["https://www.hdfcbank.com/content/bbp/repositories/723fb80a-2dde-42a3-9793-7ae1be57c87f/?path=/Personal/Home/content/rates.pdf",
"https://www.yesbank.in/pdf/forexcardratesenglish_pdf",
"https://www.sbi.co.in/documents/16012/1400784/FOREX_CARD_RATES.pdf"]


def get_pdf(url):
    user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7'
    
    #url = "https://www.yesbank.in/pdf/forexcardratesenglish_pdf"
    headers={'User-Agent':user_agent,} 
    
    request=urllib.request.Request(url,None,headers) #The assembled request
    response = urllib.request.urlopen(request)
    #print(response.text)
    data = response.read()
#    print(type(data))
    
    name = url.split("www.")[-1].split("//")[-1].split(".")[0]+"_FOREX_CARD_RATES.pdf"
    f = open(name, 'wb')
    f.write(data)
    f.close()
    

for bank_url in bank_pdf_list:
    try: 
        get_pdf(bank_url)
    except:
        pass
Rochan
  • 1,412
  • 1
  • 14
  • 17