10

I have an tar.gz file on my local machine called abc.aXML.gz, which contains many XML files. I want to find some data from these files but don't know how to parse these file using Elementtree and gzip.

import xml.etree.ElementTree as ET
import gzip
document = ET.parse(gzip("abc.aXML.gz"))
root = document.getroot()
user3071284
  • 6,955
  • 6
  • 43
  • 57
shahbaz khan
  • 113
  • 1
  • 1
  • 4

4 Answers4

3

Below Code worked for me, to read and process a zipped xml file.
I have used gzip first to unzip the file and then used ElementTree.

import gzip
import xml.etree.ElementTree as ET

input = gzip.open('input-xml.gz', 'r')
tree = ET.parse(input)
root = tree.getroot()

print root.tag
print root.attrib
VMAtm
  • 27,943
  • 17
  • 79
  • 125
upkar
  • 143
  • 8
0

UPDATED

To parse a gzipped xml file, Use minidom parser, has two options:

  1. hand over the file object pointing to the xml file
  2. hand over the full content as a string

[ The second one is more powerful variant in terms of efficiency.]

import gzip
from xml.dom.minidom import parse, parseString

# open and read gzipped xml file
infile = gzip.open( abc.aXML.gz )
content = infile.read()

# parse xml file content
 dom = minidom.parseString(content)
Shahzad Barkati
  • 2,532
  • 6
  • 25
  • 33
0

To read xml files from a tar archive:

#!/usr/bin/env python
import tarfile
from contextlib import closing
from xml.etree import ElementTree as etree

with tarfile.open('xmls.tar.gz') as archive:
    for member in archive:
        if member.isreg() and member.name.endswith('.xml'): # regular xml file
            with closing(archive.extractfile(member)) as xmlfile:
                root = etree.parse(xmlfile).getroot()
                print(root)
                # use root here..
jfs
  • 399,953
  • 195
  • 994
  • 1,670
0

For me the following code worked:

import gzip
import cStringIO
from lxml import etree
from xml.dom import minidom

path                = 'Some path ending in .xml.gz'
a_tag_of_an_element = 'document'
fakefile            = cStringIO.StringIO(gzip.open(path, 'rb').read())
root                = etree.iterparse(fakefile, tag=a_tag_of_an_element)

metr = 0
for _, ch_tree in root:
    metr += 1
    the_tag = ch_tree.tag
    rough_string    =  etree.tostring(ch_tree, encoding='utf-8')
    reparsed        = minidom.parseString(rough_string)
    print(reparsed.toprettyxml(indent="\t"))

print(metr)

It iteratively parses the xml file without extracting it from the gz format.