1

I am very new to the python scripting language and am recently working on a parser which parses a web-based xml file.

I am able to retrieve all but one of the elements using minidom in python with no issues however I have one node which I am having trouble with. The last node that I require from the XML file is 'url' within the 'image' tag and this can be found within the following xml file example:

<events>
    <event id="abcde01">
        <title> Name of event </title>
        <url> The URL of the Event <- the url tag I do not need </url>
        <image> 
            <url> THE URL I DO NEED </url>
        </image>
    </event>

Below I have copied brief sections of my code which I feel may be of relevance. I really appreciate any help with this to retrieve this last image url node. I will also include what I have tried and the error I recieved when I ran this code in GAE. The python version I am using is Python 2.7 and I should probably also point out that I am saving them within an array (for later input to a database).

class XMLParser(webapp2.RequestHandler):
def get(self):
        base_url = 'http://api.eventful.com/rest/events/search?location=Dublin&date=Today'
        #downloads data from xml file:
        response = urllib.urlopen(base_url)
        #converts data to string
        data = response.read()
        unicode_data = data.decode('utf-8')
        data = unicode_data.encode('ascii','ignore')
        #closes file
        response.close()
        #parses xml downloaded
        dom = mdom.parseString(data)        
        node = dom.documentElement  #needed for declaration of variable
        #print out all event names (titles) found in the eventful xml
        event_main = dom.getElementsByTagName('event')

        #URLs list parsing - MY ATTEMPT - 
        urls_list = []
        for im in event_main:
            image_url = image.getElementsByTagName("image")[0].childNodes[0]
            urls_list.append(image_url)

The error I receive is the following any help is much appreciated, Karen

image_url = im.getElementsByTagName("image")[0].childNodes[0]
IndexError: list index out of range
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
Karen
  • 2,469
  • 3
  • 14
  • 10
  • Do not decode and re-encode the data! Leave decoding to the XML parser. Any reason you cannot use the [ElementTree API](http://docs.python.org/2/library/xml.etree.elementtree.html) instead of the minidom? – Martijn Pieters Apr 20 '13 at 08:08
  • That URL returns an error response for me; I get an `Authentication Error` message. Perhaps you do too? – Martijn Pieters Apr 20 '13 at 08:11
  • Hi @MartijnPieters, I have left out the api key for this example just as I thought it would keep it more simple. I can insert the api key if you feel this would be more useful however I am not having issues with this, it is more so accessing the elements of image tag. I have had to decode and reencode the xml data after it was parsed due to an encoding issue with a black star found within the xml data. http://stackoverflow.com/questions/16026594/unicode-encoding-errors-python-parsing-xml-cant-encode-a-character-star/16073981?noredirect=1#16073981 – Karen Apr 20 '13 at 09:25
  • That doesn't look like an issue with the *XML input* at all! There you are *encoding* Unicode data, the error does not lie with your XML. The problem there most likely is with the `print` stament and whatever is your `stdout` at that time. Without a traceback that is impossible to diagnose any further though. – Martijn Pieters Apr 20 '13 at 09:38
  • No need for the API key, just covering all the bases. – Martijn Pieters Apr 20 '13 at 09:40

1 Answers1

0

First of all, do not reencode the content. There is no need to do so, XML parsers are perfectly capable of handling encoded content.

Next, I'd use the ElementTree API for a task like this:

from xml.etree import ElementTree as ET

response = urllib.urlopen(base_url)
tree = ET.parse(response)

urls_list = []
for event in tree.findall('.//event[image]'):
    # find the text content of the first <image><url> tag combination:
    image_url = event.find('.//image/url')
    if image_url is not None:
        urls_list.append(image_url.text)

This only consideres event elements that have a direct image child element.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343