3

I have a folder of XML files that I would like to parse. I need to get text out of the elements of these files. They will be collected and printed to a CSV file where the elements are listed in columns.

I can actually do this right now for some of my files. That is, for many of my XML files, the process goes fine, and I get the output I want. The code that does this is:

import os, re, csv, string, operator
import xml.etree.cElementTree as ET
import codecs
def parseEO(doc):
    #getting the basic structure
    tree = ET.ElementTree(file=doc)
    root = tree.getroot()
    agencycodes = []
    rins = []
    titles =[]
    elements = [agencycodes, rins, titles]
    #pulling in the text from the fields
    for elem in tree.iter():
        if elem.tag == "AGENCY_CODE":
            agencycodes.append(int(elem.text))
        elif elem.tag == "RIN":
            rins.append(elem.text)
        elif elem.tag == "TITLE":
            titles.append(elem.text)
    with open('parsetest.csv', 'w') as f:
        writer = csv.writer(f)
        writer.writerows(zip(*elements))


parseEO('EO_file.xml')     

However, on some versions of the input file, I get the infamous error:

'ascii' codec can't encode character u'\x97' in position 32: ordinal not in range(128)

The full traceback is:

    ---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
<ipython-input-15-28d095d44f02> in <module>()
----> 1 execfile(r'/parsingtest.py') # PYTHON-MODE

/Users/ian/Desktop/parsingtest.py in <module>()
     91         writer.writerows(zip(*elements))
     92 
---> 93 parseEO('/EO_file.xml')
     94 
     95 

/parsingtest.py in parseEO(doc)
     89     with open('parsetest.csv', 'w') as f:
     90         writer = csv.writer(f)
---> 91         writer.writerows(zip(*elements))
     92 
     93 parseEO('/EO_file.xml')
UnicodeEncodeError: 'ascii' codec can't encode character u'\x97' in position 32: ordinal not in range(128)

I am fairly confident from reading the other threads that the problem is in the codec being used (and, you know, the error is pretty clear on that as well). However, the solutions I have read haven't helped me (emphasized because I understand I am the source of the problem, not the way people have answered in the past).

Several repsonses (such as: this one and this one and this one) don't deal directly with ElementTree, and I'm not sure how to translate the solutions into what I'm doing.

Other solutions that do deal with ElementTree (such as: this one and this one) are either using a short string (the first link here) or are using the .tostring/.fromstring methods in ElementTree which I do not. (Though, of course, perhaps I should be.)

Things I have tried that didn't work:

  1. I have attempted to bring in the file via UTF-8 encoding:

    infile = codecs.open('/EO_file.xml', encoding="utf-8")
    parseEO(infile)
    

    but I think the ElementTree process already understands it to be UTF-8 (which is noted in the first line of all the XML files I have), and so this is not only not correct, but is actually redundantly bad all over again.

  2. I attempted to declare an encoding process within the loop, replacing:

    tree = ET.ElementTree(file=doc)
    

    with

    parser = ET.XMLParser(encoding="utf-8")
    tree = ET.parse(doc, parser=parser)
    

    in the loop above that does work. This didn't work for me either. The same files that worked before still worked, the same files that created the error still created the error.

There have been a lot of other random attempts, but I won't belabor the point.

So, while I assume the code I have is both inefficient and offensive to good programming style, it does do what I want for several files. I am trying to understand if there is simply an argument I'm missing that I don't know about, if I should somehow pre-process the files (I have not identified where the offending character is, but do know that u'\x97 translates to a control character of some kind), or some other option.

Community
  • 1
  • 1
Savage Henry
  • 1,990
  • 3
  • 21
  • 29

2 Answers2

10

You are parsing XML; the XML API hands you unicode values. You are then attempting to write the unicode data to a CSV file without encoding it first. Python then attempts to encode it for you but fails. You can see this in your traceback, it is the .writerows() call that fails, and the error tells you that encoding is failing, and not decoding (parsing the XML).

You need to choose an encoding, then encode your data before writing:

for elem in tree.iter():
    if elem.tag == "AGENCY_CODE":
        agencycodes.append(int(elem.text))
    elif elem.tag == "RIN":
        rins.append(elem.text.encode('utf8'))
    elif elem.tag == "TITLE":
        titles.append(elem.text.encode('utf8'))

I used the UTF8 encoding because it can handle any Unicode code point, but you need to make your own, explicit choice.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
3

It sounds like you have a unicode character somewhere in your xml file. Unicode is different than a string that is encoded utf8.

The python2.7 csv library doesn't support unicode characters so you'll have to run the data through a function that encodes them before you dump them into your csv file.

def normalize(s):
    if type(s) == unicode: 
        return s.encode('utf8', 'ignore')
    else:
        return str(s)

so your code would look like this:

for elem in tree.iter():
    if elem.tag == "AGENCY_CODE":
        agencycodes.append(int(elem.text))
    elif elem.tag == "RIN":
        rins.append(normalize(elem.text))
    elif elem.tag == "TITLE":
        titles.append(normalize(elem.text))
eblahm
  • 2,484
  • 1
  • 16
  • 13
  • Greatly appreciated. The suggestion from Martijn was the first one I tried and that fixed my immediate problem. This looks like something I can learn from for the next round. Thanks so much for taking the time to post. – Savage Henry Jun 23 '13 at 00:18
  • Ive spent many hours (unfortunately) trying to debug unicode errors. Its taken a while for me to wrap my mind around it! Happy to help! – eblahm Jun 23 '13 at 00:23
  • Don't use `type(s) == unicode`; use `isinstance(s, unicode)` instead. – Martijn Pieters Jun 23 '13 at 00:36