Having issues with Python xpath scraping

Question

I'm back again with a question for the wonderful people here :)

Ive recently begun getting back into python (50% done at codcademy lol) and decided to make a quick script for web-scraping the spot price of gold in CAD. This will eventually be a part of a much bigger script... but Im VERY rusty and thought it would be a good project.

My issue: I have been following the guide over at http://docs.python-guide.org/en/latest/scenarios/scrape/ to accomplish my goal, however my script always returns/prints

<Element html at 0xRANDOM>

with RANDOM being a (i assume) random hex number. This happens no matter what website I seem to use.

My Code:

#!/bin/python
#Scrape current gold spot price in CAD

from lxml import html
import requests

def scraped_price():
    page = requests.get('http://goldprice.org/gold-price-canada.html')
    tree = html.fromstring(page.content)

    print "The full page is: ", tree #added for debug WHERE ERROR OCCURS
    bid = tree.xpath("//span[@id='gpotickerLeftCAD_price']/text()")
    print "Scraped content: ", bid
    return bid
gold_scraper = scraped_price()

My research:

1) www.w3schools.com/xsl/xpath_syntax.asp

This is where I figured out to use '//span' to find all 'span' objects and then used the @id to narrow it down to the one I need.

2)Scraping web content using xpath won't work

This makes me think I simply have a bad tree.xpath setup. However I cannot seem to figure out where or why.

Any assistance would be greatly appreciated.

alecxe · Accepted Answer · 2016-01-20T02:57:09.437

1

<Element html at 0xRANDOM>

What you see printed is the lxml.html's Element class string representation. If you want to see the actual HTML content, use tostring():

print(html.tostring(tree, pretty_print=True))

You are also getting Scraped content: [] printed which really means that there were no elements matching the locator. And, if you would see the previously printed out HTML, there is actually no element with id="gpotickerLeftCAD_price" in the downloaded source.

The prices on this particular site are retrieved dynamically with continuous JSONP GET requests issued periodically. You can either look into simulating these requests, or stay on a higher level automating a browser via selenium. Demo (using PhantomJS headless browser):

>>> import time
>>> from selenium import webdriver
>>> 
>>> driver = webdriver.PhantomJS()
>>> driver.get("http://goldprice.org/gold-price-canada.html")
>>> while True:
...     print(driver.find_element_by_id("gpotickerLeftCAD_price").text)
...     time.sleep(1)
... 
1,595.28
1,595.28
1,595.28
1,595.28
1,595.28
1,595.19
...

edited Jan 20 '16 at 02:57

answered Jan 20 '16 at 02:46

alecxe

462,703
120
1,088
1,195

switching to html.tostring gives me an error: Type 'str' cannot be serialized – L8NIT3TR0UBL3 Jan 20 '16 at 02:55
Im not quite sure I understand, My code has the html.fromstring section within the variable 'tree' which when changed to 'tostring' gives a serialize error. Im not sure how to implement your fix into my code. I apologize, its been a long day and Im not working at full capacity lol - maybe selenium would be a better idea for me lol – L8NIT3TR0UBL3 Jan 20 '16 at 03:03
@L8NIT3TR0UBL3 it's okay, here is what I'm executing: https://gist.github.com/alecxe/353738460807a04faa2a. Also, updated the answer with a full story. – alecxe Jan 20 '16 at 03:05
absolutely perfect! I apologize for being so dense lol I may try both ways and see which i like best :) Thank you so much :) – L8NIT3TR0UBL3 Jan 20 '16 at 03:07
Just a quick addon, if PhantomJS needs to be downloaded for this to work, how could I bundle the binary or at least libraries so I can make an executable from the eventual bigger program? – L8NIT3TR0UBL3 Jan 20 '16 at 03:26
@L8NIT3TR0UBL3 sorry, this is not something I can help you with. Happy scraping! – alecxe Jan 20 '16 at 05:58

Having issues with Python xpath scraping

1 Answers1