2

I am a beginner to Python and am currently parsing a web-based XML file from the eventful.com API however, I am receiving some unicode errors when retrieving certain elements of the data.

I am able to retrieve 5 data elements without any problems which I want from the xml file, however then it terminates and produces the following error in the GAE error console:

UnicodeEncodeError: 'ascii' codec can't encode character u'\u2605' in position 0: ordinal not in range(128)

I know that the character that is throwing my parser is a "★" character, which I would prefer to not retrieve from the xml file anyway.

My code is as follows:

class XMLParser(webapp2.RequestHandler):
        def get(self):
        base_url = 'my xml file'
        #downloads data from xml file
        response = urllib.urlopen(base_url)
        #converts data to string:
        data = response.read()

        #closes file
        response.close()

        #parses xml downloaded
        dom = mdom.parseString(data)
        node = dom.documentElement  
        #print out all event names (titles) found in the eventful xml
        event_main = dom.getElementsByTagName('event')

        event_names = []
        for event in event_main:
            eventObj = event.getElementsByTagName("title")[0]
            event_names.append(eventObj)

        for ev in event_names:
            nodes = ev.childNodes
            for node in nodes:
                if node.nodeType == node.TEXT_NODE:
                    print node.data

Is there any way that I would be able to retrieve the "title" elements and ignore funny characters like the ★ character here? I would really appreciate any help on this matter. I have already tried solutions which uses word.encode('us-ascii', 'ignore') but this is not fixing the issue.

-----------I HAVE FOUND THE SOLUTION:

So as I was having such issues with this problem and after talking to a lecturer on this topic I was able to find that all it required was two lines of code to both encode and decode the parsed xml file (after it was read into the program). Hope this helps someone else having the same issue!

unicode_data = data.decode('utf-8')
data = unicode_data.encode('ascii','ignore')
Karen
  • 2,469
  • 3
  • 14
  • 10
  • 1
    http://stackoverflow.com/questions/3224268/python-unicode-encode-error?rq=1 – Patashu Apr 16 '13 at 01:56
  • I have tried such unicode encodings and decoding methods suggested and still no such luck. – Karen Apr 16 '13 at 11:19
  • Your problem here is in *printing* the node data. Without a full traceback for the exception I cannot help diagnose this any more, but the problem cannot possibly lie with what the parser deals with. – Martijn Pieters Apr 20 '13 at 09:38

1 Answers1

1

Where are you using your decoding methods?

I had this error in the past and had to decode the raw. In other words, I would try doing

data = response.read()
#closes file
response.close()
#decode
data.encode("us-ascii")

That is if it is in fact ascii. My point being make sure you are encoding/decoding the raw results while it is still in a string format, before you call parseString on it.

Josh
  • 118
  • 4