9

*Note: lxml will not run on my system. I was hoping to find a solution that does not involve lxml.

I have gone through some of the documentation around here already, and am having difficulties getting this to work how I would like to. I would like to parse some XML file that looks like this:

<dict>
    <key>1375</key>
    <dict>
        <key>Key 1</key><integer>1375</integer>
        <key>Key 2</key><string>Some String</string>
        <key>Key 3</key><string>Another string</string>
        <key>Key 4</key><string>Yet another string</string>
        <key>Key 5</key><string>Strings anyone?</string>
    </dict>
</dict>

In the file I am trying to manipulate, there are more 'dict' that follow this one. I would like to read through the XML and output a text/dat file that would look like this:

1375, "Some String", "Another String", "Yet another string", "Strings anyone?"

...

Eof

** Originally, I tried to use lxml, but after many tries to get it working on my system, I moved on to using DOM. More recently, I tried using Etree to do this task. Please, for the love of all that is good, would somebody help me along with this? I am relatively new to Python and would like to learn how this works. I thank you in advance.

PleaseHelpTheNewGuy
  • 167
  • 3
  • 4
  • 13

2 Answers2

10

You can use xml.etree.ElementTree which is included with Python. There is an included companion C-implemented (i.e. much faster) xml.etree.cElementTree. lxml.etree offers a superset of the functionality but it's not needed for what you want to do.

The code provided by @Acorn works identically for me (Python 2.7, Windows 7) with each of the following imports:

import xml.etree.ElementTree as et
import xml.etree.cElementTree as et
import lxml.etree as et
...
tree = et.fromstring(xmltext)
...

What OS are you using and what installation problems have you had with lxml?

John Machin
  • 81,303
  • 11
  • 141
  • 189
  • I'm using Ubuntu Maverick Meerkat Netbook installation...the latest lxml installation attempt included this message in my terminal: Unpacking python-lxml (from .../python-lxml_2.2.6-1_i386.deb) ... Setting up firmware-b43-installer (4.150.10.5-4) ... Not supported low-power chip with PCI id 14e4:4315! Aborting. – PleaseHelpTheNewGuy Oct 29 '11 at 22:21
  • I just tried the new imports with the code and got this error: Traceback (most recent call last): File "/home/worky.py", line 5, in import lxml.etree as et ImportError: No module named lxml.etree – PleaseHelpTheNewGuy Oct 29 '11 at 22:26
  • (1) About your Ubuntu installation problem: I suggest that you try the lxml mailing list. (2) "No module named lxml.etree" ... that's because it's not installed. Have only one import active at a time; comment out the other two. – John Machin Oct 29 '11 at 23:08
  • ok, John, that kind of helps, I'm messing around with the code now... I might be able to swing it with this code, although... it's not exactly what I need... if I can get it to work, it IS what I need I guess. Thanks for the tips. – PleaseHelpTheNewGuy Oct 29 '11 at 23:12
7
import xml.etree.ElementTree as et
import csv

xmltext = """
<dicts>
    <key>1375</key>
    <dict>
        <key>Key 1</key><integer>1375</integer>
        <key>Key 2</key><string>Some String</string>
        <key>Key 3</key><string>Another string</string>
        <key>Key 4</key><string>Yet another string</string>
        <key>Key 5</key><string>Strings anyone?</string>
    </dict>
</dicts>
"""

f = open('output.txt', 'w')

writer = csv.writer(f, quoting=csv.QUOTE_NONNUMERIC)

tree = et.fromstring(xmltext)

# iterate over the dict elements
for dict_el in tree.iterfind('dict'):
    data = []
    # get the text contents of each non-key element
    for el in dict_el:
        if el.tag == 'string':
            data.append(el.text)
        # if it's an integer element convert to int so csv wont quote it
        elif el.tag == 'integer':
            data.append(int(el.text))
    writer.writerow(data)
Acorn
  • 49,061
  • 27
  • 133
  • 172
  • Thanks for posting so soon. The problem is, I cannot get lxml to run on my machine. I have python 2.7 and have made several attempts to get that module installed, but have failed. I was hoping there was another way that doesn't involve lxml. – PleaseHelpTheNewGuy Oct 29 '11 at 21:14
  • 1
    What OS are you running? – Acorn Oct 29 '11 at 21:40
  • I'm running Ubuntu Maverick Meerkat Netbook edition... – PleaseHelpTheNewGuy Oct 29 '11 at 22:27
  • How are you trying to install it? have you tried installing it with PIP? – Acorn Oct 29 '11 at 22:31
  • Ok, I am installing pip now, I will try to figure out how to use it to install it. BTW, it's snowing in New York... wth?! and thanks for the help. – PleaseHelpTheNewGuy Oct 29 '11 at 22:35
  • Once you have it installed, just `pip install lxml` – Acorn Oct 29 '11 at 22:38
  • Ugh: building 'lxml.etree' extension creating build/temp.linux-i686-2.6 creating build/temp.linux-i686-2.6/src creating build/temp.linux-i686-2.6/src/lxml gcc -pthread -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fPIC -I/usr/include/python2.6 -c src/lxml/lxml.etree.c -o build/temp.linux-i686-2.6/src/lxml/lxml.etree.o -w In file included from src/lxml/lxml.etree.c:239: src/lxml/etree_defs.h:9: fatal error: libxml/xmlversion.h: No such file or directory compilation terminated. error: command 'gcc' failed with exit status 1 Rolling back uninstall of lxml – PleaseHelpTheNewGuy Oct 29 '11 at 22:44
  • I thought lxml was just 2.7, but 2.6 was being used there?! I have 2.7 installed, and that's the IDLE I have open and am using... – PleaseHelpTheNewGuy Oct 29 '11 at 22:47
  • I think you need to have libxml2 and libxslt installed. Check out these instructions: http://tightwadtechnica.com/?page_id=4163 – Acorn Oct 29 '11 at 22:48
  • wow. "Not supported low-power chip" and then it aborts when I try to install libxml2 using apt-get – PleaseHelpTheNewGuy Oct 29 '11 at 22:55
  • trying to use PIP now to install those... – PleaseHelpTheNewGuy Oct 29 '11 at 22:56
  • Ok, well... I'm still getting the same errors when trying to install the things needed to install lxml. This is why I was hoping to find a solution to the XML parsing without using lxml. – PleaseHelpTheNewGuy Oct 29 '11 at 23:03
  • @PleaseHelpTheNewGuy: **I've given you the solution: xml.etree.[c]ElementTree** – John Machin Oct 29 '11 at 23:10
  • John, I'm sorry for my ignorance, I'm not really sure what you mean by that : oh wait, ok, I'm guessing you mean the line that's almost like that. – PleaseHelpTheNewGuy Oct 29 '11 at 23:13
  • @PleaseHelpTheNewGuy: I meant with or without the "c" – John Machin Oct 29 '11 at 23:31
  • Ok, this has all been helpful. I should be able to hack and slash through the rest of what it is I am trying to do. Thanks so much guys. I'm still learning Python and this has been good. – PleaseHelpTheNewGuy Oct 29 '11 at 23:52
  • I've changed my example to use xml.etree – Acorn Oct 29 '11 at 23:58