I'm trying to scrape TD Asset Management pages (example below; I can't post more than two links) in order to retrieve the "price as on" value, i.e. the dollar amount in this snippet of HTML:
<div class="td-layout-grid9 td-layout-column td-layout-column-first">
Price As On: Jun 12, 2015
<br>
<strong>$14.54 </strong>
<strong class="td-copy-red">-0.01 (-0.07%)</strong>
</div>
I was hoping to achieve this with Python, requests, lxml, and XPath, which I installed as follows:
apt-get update
apt-get install python python-pip python-dev gcc build-essential libxml2-dev libxslt-dev libffi-dev libssl-dev
pip install lxml
pip install requests
pip install requests[security]
Next, to retrieve the page I did this:
python
>>> from lxml import html
>>> import requests
>>> page = requests.get('https://www.tdassetmanagement.com/fundDetails.form?fundId=6320&lang=en')
>>> tree = html.fromstring(page.text)
Finally, an attempt was made to retrieve the desired dollar value using the XPath of the relevant element as obtained from Chrome's "Inspect Element" tool:
>>> price = tree.xpath('//*[@id="fundCardVO"]/div[2]/div[1]/div[1]/div[1]/strong[1]')
>>> print price
Unfortunately the result is [<Element strong at 0x29a9998>]
rather than the expected dollar amount $14.54
.
To ensure that the expected data was retrieved by the initial "requests.get", I ran this:
>>> print page.content
The result can be seen here: http://pastebin.com/f5C4MFQb.
If I paste the above HTML into this tool: http://videlibri.sourceforge.net/cgi-bin/xidelcgi my XPath query //*[@id="fundCardVO"]/div[2]/div[1]/div[1]/div[1]/strong[1]
returns the dollar amount as expected.
Any hints or tips as to how I might be able to use Python, lxml, and XPath to retrieve the desired value for this element would be very much appreciated. If there's a completely different way that I could be going about this to obtain the same result I would be interested in that too.
Thanks.