1

I'm trying to scrape TD Asset Management pages (example below; I can't post more than two links) in order to retrieve the "price as on" value, i.e. the dollar amount in this snippet of HTML:

<div class="td-layout-grid9 td-layout-column td-layout-column-first">
Price As On: Jun 12, 2015
<br>
<strong>$14.54  </strong>
<strong class="td-copy-red">-0.01 (-0.07%)</strong>
</div>

I was hoping to achieve this with Python, requests, lxml, and XPath, which I installed as follows:

apt-get update
apt-get install python python-pip python-dev gcc build-essential libxml2-dev libxslt-dev libffi-dev libssl-dev
pip install lxml
pip install requests
pip install requests[security]

Next, to retrieve the page I did this:

python
>>> from lxml import html
>>> import requests
>>> page = requests.get('https://www.tdassetmanagement.com/fundDetails.form?fundId=6320&lang=en')
>>> tree = html.fromstring(page.text)

Finally, an attempt was made to retrieve the desired dollar value using the XPath of the relevant element as obtained from Chrome's "Inspect Element" tool:

>>> price = tree.xpath('//*[@id="fundCardVO"]/div[2]/div[1]/div[1]/div[1]/strong[1]')
>>> print price

Unfortunately the result is [<Element strong at 0x29a9998>] rather than the expected dollar amount $14.54&nbsp;&nbsp;.

To ensure that the expected data was retrieved by the initial "requests.get", I ran this:

>>> print page.content

The result can be seen here: http://pastebin.com/f5C4MFQb.

If I paste the above HTML into this tool: http://videlibri.sourceforge.net/cgi-bin/xidelcgi my XPath query //*[@id="fundCardVO"]/div[2]/div[1]/div[1]/div[1]/strong[1] returns the dollar amount as expected.

Any hints or tips as to how I might be able to use Python, lxml, and XPath to retrieve the desired value for this element would be very much appreciated. If there's a completely different way that I could be going about this to obtain the same result I would be interested in that too.

Thanks.

chris1h1
  • 31
  • 4
  • You're getting a list containing a single `Element`; what's the problem, exactly? – jonrsharpe Jun 13 '15 at 19:49
  • The problem is that I have no idea how to get the desired value of `$14.54  `. If I do something like `>>> print tree.xpath('//title/text()')` I get the actual title, whereas when I try to get the dollar value I get `[]` instead. I should mention that I'm a complete beginner. Thanks for your help. – chris1h1 Jun 13 '15 at 19:50
  • 2
    Have you tried reading the documentation to find out what an `Element` is, or using e.g. `dir` to find out about its attributes? – jonrsharpe Jun 13 '15 at 19:51

2 Answers2

2

After further Googling to find out what elements are (they're lists of things with attributes like tag or text), followed by more Googling regarding a UnicodeEncodeError (see UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 20: ordinal not in range(128)) I was able to obtain my desired value with this:

>>> priceelement = tree.xpath('//*[@id="fundCardVO"]/div[2]/div[1]/div[1]/div[1]/strong[1]')
>>> priceascii = priceelement[0].text
>>> price = priceascii.encode('utf-8')
>>> print price

Thanks for nudging me in the right direction jonrsharpe.

I still was not able to determine how to obtain a list of available attributes for the element though, but tag and text were available.

I went on to get just the number (without the dollar symbol and trailing non-breaking spaces) with this:

>>> import re
>>> p = re.search('[0-9]{1,3}\.[0-9]{2}', price)
>>> price = p.group(0)
>>> print price
Community
  • 1
  • 1
chris1h1
  • 31
  • 4
  • 1
    *"I still was not able to determine how to obtain a list of available attributes for the element"* - `dir(priceelement[0])`. – jonrsharpe Jun 13 '15 at 21:27
0

use FOR RANGE: for x in price: print(x.text)

  • 1
    As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Mar 09 '23 at 23:29