Parsing a xml.gz file in python

Question

I have an tar.gz file on my local machine called abc.aXML.gz, which contains many XML files. I want to find some data from these files but don't know how to parse these file using Elementtree and gzip.

import xml.etree.ElementTree as ET
import gzip
document = ET.parse(gzip("abc.aXML.gz"))
root = document.getroot()

Possible duplicate of [How do I parse XML in Python?](http://stackoverflow.com/questions/1912434/how-do-i-parse-xml-in-python) — Shahzad Barkati, Oct 26 '15 at 13:19
here's a [code example how to parse incrementally a gzip file that contains a *single* xml document](http://stackoverflow.com/a/26435241/4279) — jfs, Oct 26 '15 at 13:58
do you mean you have a **tar.gz** (note: `tar`) archive that contains *multiple* files? A gzip archive may contain only *one* file. — jfs, Oct 26 '15 at 14:04

score 3 · Answer 1 · edited Aug 06 '19 at 11:18

3

Below Code worked for me, to read and process a zipped xml file.
I have used gzip first to unzip the file and then used ElementTree.

import gzip
import xml.etree.ElementTree as ET

input = gzip.open('input-xml.gz', 'r')
tree = ET.parse(input)
root = tree.getroot()

print root.tag
print root.attrib

edited Aug 06 '19 at 11:18

VMAtm

27,943
17
79
125

answered Jul 23 '19 at 09:08

upkar

143
8

Shahzad Barkati · Answer 2 · 2019-04-27T12:47:02.250

0

UPDATED

To parse a gzipped xml ﬁle, Use minidom parser, has two options:

hand over the ﬁle object pointing to the xml ﬁle
hand over the full content as a string

[ The second one is more powerful variant in terms of eﬃciency.]

import gzip
from xml.dom.minidom import parse, parseString

# open and read gzipped xml file
infile = gzip.open( abc.aXML.gz )
content = infile.read()

# parse xml file content
 dom = minidom.parseString(content)

edited Apr 27 '19 at 12:47

answered Oct 26 '15 at 13:57

Shahzad Barkati

2,532
6
25
33

1

Changing the last line to dom = minidom.parseString(content) worked for me – anish Apr 24 '19 at 13:53

jfs · Answer 3 · 2015-10-26T18:28:17.273

To read xml files from a tar archive:

#!/usr/bin/env python
import tarfile
from contextlib import closing
from xml.etree import ElementTree as etree

with tarfile.open('xmls.tar.gz') as archive:
    for member in archive:
        if member.isreg() and member.name.endswith('.xml'): # regular xml file
            with closing(archive.extractfile(member)) as xmlfile:
                root = etree.parse(xmlfile).getroot()
                print(root)
                # use root here..

score 0 · Answer 4 · answered Oct 30 '18 at 14:34

For me the following code worked:

import gzip
import cStringIO
from lxml import etree
from xml.dom import minidom

path                = 'Some path ending in .xml.gz'
a_tag_of_an_element = 'document'
fakefile            = cStringIO.StringIO(gzip.open(path, 'rb').read())
root                = etree.iterparse(fakefile, tag=a_tag_of_an_element)

metr = 0
for _, ch_tree in root:
    metr += 1
    the_tag = ch_tree.tag
    rough_string    =  etree.tostring(ch_tree, encoding='utf-8')
    reparsed        = minidom.parseString(rough_string)
    print(reparsed.toprettyxml(indent="\t"))

print(metr)

It iteratively parses the xml file without extracting it from the gz format.

```import cStringIO``` should be ```from io import StringIO``` — McM, Aug 05 '22 at 19:42

Parsing a xml.gz file in python

4 Answers4