1

I'm new to Python and have heard that it is one of the best ways to parse fairly large XML files (150MB). I can't get my head around how to iterate through the tags and extract only the <hw> and <defunit> tags as it's fairly deeply nested.

I have some XML formatted as below, and I need to extract the "hw" and "defunit" tags from it using Python and convert them into a .csv format.

<?xml version="1.0" encoding="UTF-8"?>
<dps-data xmlns="urn:DPS2-metadata" project="SCRABBLELARGE" guid="7d6b7164fde1e064:34368a61:14306b637ab:-8000--4a25ae5c-c104-4c7a-bba5-b434dd4d9314">
    <superentry xmlns="urn:COLL" xmlns:d="urn:COLL" xmlns:e="urn:IDMEE" e:id="u583c10bfdbd326ba.31865a51.12110e76de1.-336">
        <entry publevel="1" id="a000001" e:id="u583c10bfdbd326ba.31865a51.12110e76de1.-335">
            <hwblk>
                <hwgrp>
                    <hwunit>
                        <hw>aa</hw>
                        <ulsrc>edsh</ulsrc>
                    </hwunit>
                </hwgrp>
            </hwblk>
            <datablk>
                <gramcat publevel="1" id="a000001.001">
                    <pospgrp>
                        <pospunit>
                            <posp value="noun" />
                        </pospunit>
                    </pospgrp>
                    <sensecat id="a000001.001.01" publevel="1">
                        <defgrp>
                            <defunit>
                                <def>volcanic rock</def>
                            </defunit>
                        </defgrp>
                    </sensecat>
                </gramcat>
            </datablk>
        </entry>
    </superentry>
  </dps-data>

The .csv format I'd like to see it in is simply:

hw, defunit
aa, volcanic rock
Parfait
  • 104,375
  • 17
  • 94
  • 125
Powellellogram
  • 426
  • 1
  • 6
  • 17

3 Answers3

3

The lxml library is capable of very powerful XML parsing, and can be used to iterate over an XML tree to search for specific elements.

from lxml import etree

with open(r'path/to/xml', 'r') as xml:
    text = xml.read()
tree = lxml.etree.fromstring(text)
row = ['', '']
for item in tree.iter('hw', 'def'):
    if item.tag == 'hw':
       row[0] = item.text
    elif item.tag == 'def':
       row[1] = item.text

line = ','.join(row)

with open(r'path/to/csv', 'a') as csv:
     csv.write(line + '\n')

How you build the CSV file is largely based upon preference, but I have provided a trivial example above. If there are multiple <dps-data> tags, you could extract those elements first (which can be done with the same tree.iter method shown above), and then apply the above logic to each of them.

EDIT: I should point out that this particular implementation reads the entire XML file into memory. If you are working with a single 150mb file at a time, this should not be a problem, but it's just something to be aware of.

VergeA
  • 89
  • 6
  • I get the error "can only concatenate list (not "str") to list" when using this script. I understand I need to convert the value of `row` to string but am unsure how! – Powellellogram Sep 26 '16 at 10:40
  • Can you post your full error message? Additionally, I realized that the script should be searching for `def`, not `defunit`, as the `def` tag contains the text that you want. – VergeA Sep 26 '16 at 13:26
  • `C:\Users\...\Local\Programs\Python\Python35-32\python.exe C:/Users/.../PycharmProjects/xmltocsv2/xmltocsv.py Traceback (most recent call last): File "C:/Users/mbpowell/PycharmProjects/xmltocsv2/xmltocsv.py", line 18, in csv.write(row + '\n') TypeError: can only concatenate list (not "str") to list` is the error message I get. – Powellellogram Sep 26 '16 at 14:37
  • I have found my mistake and edited the code. The issues was that when I called `','.join(row)`, I did not assign the result to a variable, meaning that the string produced was never saved for later use. Use the new `line` variable instead. – VergeA Sep 26 '16 at 15:07
  • Yes I tried that variable fix too. The script then runs with no errors, but the csv file has nothing written to it. – Powellellogram Sep 26 '16 at 15:16
  • I just tested the current code with the xml you provided, and my csv file contains 'aa,volcanic rock', as it should. Have you tried running your code in a debugger, and seeing what happens at the step in which the CSV is written to? – VergeA Sep 26 '16 at 15:59
2

How about this:

from xml.dom import minidom

xmldoc = minidom.parse('your.xml')
hw_lst = xmldoc.getElementsByTagName('hw')
defu_lst = xmldoc.getElementsByTagName('def')

with open('your.csv', 'a') as out_file:
    for i in range(len(hw_lst)):
        out_file.write('{0}, {1}\n'.format(hw_lst[i].firstChild.data, defu_lst[i].firstChild.data)) 
picmate 涅
  • 3,951
  • 5
  • 43
  • 52
  • This script works up to a point, but renders my computer unusable whilst the script runs. Then I get an ascii error part way through and it doesn't finish the entire file. Thanks though! – Powellellogram Sep 26 '16 at 09:28
  • @Powellellogram, do you have non -ascii characters in your xml file? Then you need to handle them accordingly. When running this program, did you check why your computer freeze? Did you try to debug and step through the code to see what could be wrong? If not, I would recommend you to do that. – picmate 涅 Sep 26 '16 at 13:48
2

Consider XSLT, the XML transformation language that can manipulate source .xml files to various end use structures including text files like .csv, specifying method="text" in <xsl:output>.

Python's lxml module can run XSLT 1.0 scripts. Below assumes the <entry> tag and its children repeat with different data. And two undeclared namespaces had to be handled in the xsl. Also, XSLT tends to be very efficient on smaller sized XML but varies depending on computer environments.

XSLT Script (save as .xsl to be referenced below)

<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"
               xmlns:ns1="urn:DPS2-metadata" xmlns="urn:COLL">
<xsl:output version="1.0" encoding="UTF-8" indent="yes" method="text"/>
<xsl:strip-space elements="*"/>

<xsl:template match="/ns1:dps-data/ns1:superentry">
   <xsl:text>hw,defunit</xsl:text><xsl:text>&#xa;</xsl:text>
   <xsl:apply-templates select="ns1:entry"/>
</xsl:template>

<xsl:template match="ns1:entry" namespace="urn:COLL">    
   <xsl:value-of select="descendant::ns1:hw" namespace="urn:COLL"/><xsl:text>,</xsl:text>
   <xsl:value-of select="descendant::ns1:defunit" namespace="urn:COLL"/>
   <xsl:text>&#xa;</xsl:text>
</xsl:template>

Pyton Script

import lxml.etree as ET

// LOAD XML AND XSL SOURCES
xml = ET.parse('Input.xml')
xsl = ET.parse('XSLTScript.xsl')

// TRANSFORM SOURCE
transform = ET.XSLT(xsl)
newdom = transform(xml)

// SAVE AS .CSV
with open('Output.csv'), 'wb') as f:
    f.write(newdom)

# hw,defunit
# aa,volcanic rock
Parfait
  • 104,375
  • 17
  • 94
  • 125
  • This seems to write every word to the csv file, but I'm having an issue that there are no line breaks or commas for each row or column. – Powellellogram Sep 27 '16 at 10:52
  • Did you run the script? The xslt contains commas and line break entity, ` ` in `` tags. – Parfait Sep 27 '16 at 13:25