I'm trying to scrape this page in python to get the biggest table in that page into a csv
. I'm mostly following the answer here.
But I'm facing two problem:
- The column for Strike Price is missing
- Writing the data to csv is misaligned due to a aberrant string containing multitudes of "\r" and ending with a single "\n". This puts lot of whitespace chars in the
csv
Following is the code I'm using. Please help me fix this two problems.
from urllib2 import Request, urlopen
from lxml import etree
import csv
ourl = "http://www.nseindia.com/live_market/dynaContent/live_watch/option_chain/optionKeys.jsp?segmentLink=17&instrument=OPTIDX&symbol=NIFTY&date=31DEC2015"
headers = {'Accept' : '*/*',
'Accept-Language' : 'en-US,en;q=0.5',
'Host': 'nseindia.com',
'Referer': 'http://www.nseindia.com/live_market/dynaContent/live_market.htm',
'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:28.0) Gecko/20100101 Firefox/35.0',
'X-Requested-With': 'XMLHttpRequest'}
req = Request(ourl, None, headers)
response = urlopen(req)
the_page = response.read()
ptree = etree.HTML(the_page)
tr_nodes = ptree.xpath('//table[@id="octable"]/tr')
header = [i[0].text for i in tr_nodes[0].xpath("th")]
td_content = [[td.text for td in tr.xpath('td')] for tr in tr_nodes[1:]]
with open("nseoc.csv", "wb") as f:
writer = csv.writer(f)
writer.writerows(td_content)