I'm new to Python and have heard that it is one of the best ways to parse fairly large XML files (150MB). I can't get my head around how to iterate through the tags and extract only the <hw>
and <defunit>
tags as it's fairly deeply nested.
I have some XML formatted as below, and I need to extract the "hw" and "defunit" tags from it using Python and convert them into a .csv format.
<?xml version="1.0" encoding="UTF-8"?>
<dps-data xmlns="urn:DPS2-metadata" project="SCRABBLELARGE" guid="7d6b7164fde1e064:34368a61:14306b637ab:-8000--4a25ae5c-c104-4c7a-bba5-b434dd4d9314">
<superentry xmlns="urn:COLL" xmlns:d="urn:COLL" xmlns:e="urn:IDMEE" e:id="u583c10bfdbd326ba.31865a51.12110e76de1.-336">
<entry publevel="1" id="a000001" e:id="u583c10bfdbd326ba.31865a51.12110e76de1.-335">
<hwblk>
<hwgrp>
<hwunit>
<hw>aa</hw>
<ulsrc>edsh</ulsrc>
</hwunit>
</hwgrp>
</hwblk>
<datablk>
<gramcat publevel="1" id="a000001.001">
<pospgrp>
<pospunit>
<posp value="noun" />
</pospunit>
</pospgrp>
<sensecat id="a000001.001.01" publevel="1">
<defgrp>
<defunit>
<def>volcanic rock</def>
</defunit>
</defgrp>
</sensecat>
</gramcat>
</datablk>
</entry>
</superentry>
</dps-data>
The .csv format I'd like to see it in is simply:
hw, defunit
aa, volcanic rock