Missing one column and redundent whitespaces/newlines in webpage scraping using lxml in python 2.7

Question

I'm trying to scrape this page in python to get the biggest table in that page into a csv. I'm mostly following the answer here.

But I'm facing two problem:

The column for Strike Price is missing
Writing the data to csv is misaligned due to a aberrant string containing multitudes of "\r" and ending with a single "\n". This puts lot of whitespace chars in the csv

Following is the code I'm using. Please help me fix this two problems.

from urllib2 import Request, urlopen
from lxml import etree
import csv

ourl = "http://www.nseindia.com/live_market/dynaContent/live_watch/option_chain/optionKeys.jsp?segmentLink=17&instrument=OPTIDX&symbol=NIFTY&date=31DEC2015"

headers = {'Accept' : '*/*',
           'Accept-Language' : 'en-US,en;q=0.5',
           'Host': 'nseindia.com',
           'Referer': 'http://www.nseindia.com/live_market/dynaContent/live_market.htm',
           'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:28.0) Gecko/20100101 Firefox/35.0',
           'X-Requested-With': 'XMLHttpRequest'}

req = Request(ourl, None, headers)
response = urlopen(req)
the_page = response.read()

ptree = etree.HTML(the_page)
tr_nodes = ptree.xpath('//table[@id="octable"]/tr')
header = [i[0].text for i in tr_nodes[0].xpath("th")]
td_content = [[td.text for td in tr.xpath('td')] for tr in tr_nodes[1:]]

with open("nseoc.csv", "wb") as f:
    writer = csv.writer(f)
    writer.writerows(td_content)

score 1 · Accepted Answer · answered Dec 08 '15 at 19:27

Writing the data to csv is misaligned due to a aberrant string containing multitudes of "\r" and ending with a single "\n"

First of all, I would use lxml.html package, get the text_content() of every cell and apply strip() afterwards:

from lxml.html import fromstring   

ptree = fromstring(the_page)

tr_nodes = ptree.xpath('//table[@id="octable"]//tr')[1:]
td_content = [[td.text_content().strip() for td in tr.xpath('td')] 
              for tr in tr_nodes[1:]]

Here is how td_content would look:

[
    ['', '700', '-', '-', '-', '5,179.00', '-', '1,350', '4,972.25', '5,006.15', '450', '2700.00', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', ''], 
    ['', '-', '-', '-', '-', '-', '-', '1,200', '4,710.85', '5,254.15', '150', '2800.00', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', ''],
    ...
]

Note that the "Strike Price" is there (2700 and 2800).

Missing one column and redundent whitespaces/newlines in webpage scraping using lxml in python 2.7

1 Answers1