How to efficiently detect an XML schema without having the entire file in python

Question

I have a very large feed file that is sent as an XML document (5GB). What would be the fastest way to parse the structure of the main item node without previously knowing its structure? Is there a means in Python to do so 'on-the-fly' without having the complete xml loaded in memory? For example, what if I just saved the first 5MB of the file (by itself it would be invalid xml, as it wouldn't have ending tags) -- would there be a way to parse the schema from that?

Update: I've included an example XML fragment here: https://hastebin.com/uyalicihow.xml. I'm looking to extract something like a dataframe (or list or whatever other data structure you want to use) similar to the following:

Items/Item/Main/Platform       Items/Item/Info/Name
iTunes                         Chuck Versus First Class
iTunes                         Chuck Versus Bo

How could this be done? I've added a bounty to encourage answers here.

Are you looking to [XML_Schema_(W3C)](https://en.wikipedia.org/wiki/XML_Schema_(W3C))? — stovfl, Dec 18 '18 at 07:56
Your question is not clear. Please specify what you are exactly expecting? — Kousic, Dec 18 '18 at 12:09
I currently building a model to parse unknown `xml` schemas using `xpath` and `lxml` specifically for this bounty, but your question lacks several details, including one that I consider vital: **What will you do with the parsed `xml`?** add to `db`? write to `file`? execute `x if something`? **What's you main goal with this**? It may help us if you disclosure a bit more of what you're trying to achieve. — Pedro Lobito, Dec 18 '18 at 22:41
@PedroLobito thanks, let me update the question in a bit today. — David542, Dec 18 '18 at 23:17
I've a version of the parser that returns the `xpath` of all elements inside a chosen node, which may be what you're looking for ( `/Items/Item/...`). Don't waste your bounty, please improve your question. — Pedro Lobito, Dec 19 '18 at 19:21
@PedroLobito the goal here would be to initialize a pandas dataframe with the 'flattened' data from the xml file. — David542, Dec 23 '18 at 19:24

score 2 · Answer 1 · answered Dec 04 '18 at 21:02

Several people have misinterpreted this question, and re-reading it, it's really not at all clear. In fact there are several questions.

How to detect an XML schema

Some people have interpreted this as saying you think there might be a schema within the file, or referenced from the file. I interpreted it as meaning that you wanted to infer a schema from the content of the instance.

What would be the fastest way to parse the structure of the main item node without previously knowing its structure?

Just put it through a parser, e.g. a SAX parser. A parser doesn't need to know the structure of an XML file in order to split it up into elements and attributes. But I don't think you actually want the fastest parse possible (in fact, I don't think performance is that high on your requirements list at all). I think you want to do something useful with the information (you haven't told us what): that is, you want to process the information, rather than just parsing the XML.

Is there a python utility that can do so 'on-the-fly' without having the complete xml loaded in memory?

Yes, according to this page which mentions 3 event-based XML parsers in the Python world: https://wiki.python.org/moin/PythonXml (I can't vouch for any of them)

what if I just saved the first 5MB of the file (by itself it would be invalid xml, as it wouldn't have ending tags) -- would there be a way to parse the schema from that?

I'm not sure you know what the verb "to parse" actually means. Your phrase certainly suggests that you expect the file to contain a schema, which you want to extract. But I'm not at all sure you really mean that. And in any case, if it did contain a schema in the first 5Mb, you could find it just be reading the file sequentially, there would be no need to "save" the first part of the file first.

score 1 · Answer 2 · answered Dec 22 '18 at 14:20

Question: way to parse the structure of the main item node without previously knowing its structure

This class TopSequenceElement parse a XML File to find all Sequence Elements.
The default is, to break at the FIRST closing </...> of the topmost Element.
Therefore, it is independend of the file size or even by truncated files.

from lxml import etree
from collections import OrderedDict

class TopSequenceElement(etree.iterparse):
    """
    Read XML File
    results: .seq == OrderedDict of Sequence Element
             .element == topmost closed </..> Element
             .xpath == XPath to top_element
    """
    class Element:
        """
        Classify a Element
        """
        SEQUENCE = (1, 'SEQUENCE')
        VALUE = (2, 'VALUE')

        def __init__(self, elem, event):
            if len(elem):
                self._type = self.SEQUENCE
            else:
                self._type = self.VALUE

            self._state = [event]
            self.count = 0
            self.parent = None
            self.element = None

        @property
        def state(self):
            return self._state

        @state.setter
        def state(self, event):
            self._state.append(event)

        @property
        def is_seq(self):
            return self._type == self.SEQUENCE

        def __str__(self):
            return "Type:{}, Count:{}, Parent:{:10} Events:{}"\
                .format(self._type[1], self.count, str(self.parent), self.state)

    def __init__(self, fh, break_early=True):
        """
        Initialize 'iterparse' only to callback at 'start'|'end' Events

        :param fh: File Handle of the XML File
        :param break_early: If True, break at FIRST closing </..> of the topmost Element
                            If False, run until EOF
        """
        super().__init__(fh, events=('start', 'end'))
        self.seq = OrderedDict()
        self.xpath = []
        self.element = None

        self.parse(break_early)

    def parse(self, break_early):
        """
        Parse the XML Tree, doing
          classify the Element, process only SEQUENCE Elements
          record, count of end </...> Events, 
                  parent from this Element
                  element Tree of this Element

        :param break_early: If True, break at FIRST closing </..> of the topmost Element
        :return: None
        """
        parent = []

        try:
            for event, elem in self:
                tag = elem.tag
                _elem = self.Element(elem, event)

                if _elem.is_seq:
                    if event == 'start':
                        parent.append(tag)

                        if tag in self.seq:
                            self.seq[tag].state = event
                        else:
                            self.seq[tag] = _elem

                    elif event == 'end':
                        parent.pop()
                        if parent:
                            self.seq[tag].parent = parent[-1]

                        self.seq[tag].count += 1
                        self.seq[tag].state = event

                        if self.seq[tag].count == 1:
                            self.seq[tag].element = elem

                        if break_early and len(parent) == 1:
                            break

        except etree.XMLSyntaxError:
            pass

        finally:
            """
            Find the topmost completed '<tag>...</tag>' Element
            Build .seq.xpath
            """
            for key in list(self.seq):
                self.xpath.append(key)
                if self.seq[key].count > 0:
                    self.element = self.seq[key].element
                    break

            self.xpath = '/'.join(self.xpath)

    def __str__(self):
        """
        String Representation of the Result 
        :return: .xpath and list of .seq
        """
        return "Top Sequence Element:{}\n{}"\
            .format( self.xpath,
                     '\n'.join(["{:10}:{}"
                               .format(key, elem) for key, elem in self.seq.items()
                                ])
                     )

if __name__ == "__main__":
    with open('../test/uyalicihow.xml', 'rb') as xml_file:
        tse = TopSequenceElement(xml_file)
        print(tse)

Output:

Top Sequence Element:Items/Item
Items     :Type:SEQUENCE, Count:0, Parent:None       Events:['start']
Item      :Type:SEQUENCE, Count:1, Parent:Items      Events:['start', 'end', 'start']
Main      :Type:SEQUENCE, Count:2, Parent:Item       Events:['start', 'end', 'start', 'end']
Info      :Type:SEQUENCE, Count:2, Parent:Item       Events:['start', 'end', 'start', 'end']
Genres    :Type:SEQUENCE, Count:2, Parent:Item       Events:['start', 'end', 'start', 'end']
Products  :Type:SEQUENCE, Count:1, Parent:Item       Events:['start', 'end']
... (omitted for brevity)

Step 2: Now, you know there is a <Main> Tag, you can do:

print(etree.tostring(tse.element.find('Main'), pretty_print=True).decode())

<Main>
      <Platform>iTunes</Platform>
      <PlatformID>353736518</PlatformID>
      <Type>TVEpisode</Type>
      <TVSeriesID>262603760</TVSeriesID>
    </Main>

Step 3: Now, you know there is a <Platform> Tag, you can do:
print(etree.tostring(tse.element.find('Main/Platform'), pretty_print=True).decode())

<Platform>iTunes</Platform>

Tested with Python:3.5.3 - lxml.etree:3.7.1

chthonicdaemon · Answer 3 · 2018-12-24T04:25:33.393

My interpretation of your needs is that you want to be able to parse the partial file and build up the structure of the document as you go. I've taken some assumptions from the file you uploaded:

Fundamentally you want to be parsing collections of things which have similar properties - I'm inferring this from the way you presented your desired output as a table with rows containing the values.
You expect these collections of things to have the same number of values.
You need to be able to parse partial files.
You don't worry about the properties of elements, just their contents.

I'm using xml.sax as this deals with arbitrarily large files and doesn't need to read the whole file into memory. Note that the strategy I'm following now doesn't actually scale that well as I'm storing all the elements in memory to build the dataframe, but you could just as well output the paths and contents.

In the sample file there is a problem with having one row per Item since there are multiples of the Genre tag and there are also multiple Product tags. I've handled the repeated Genre tags by appending them. This relies on the Genre tags appearing consecutively. It is not at all clear how the Product relationships can be handled in a single table.

import xml.sax

from collections import defaultdict

class StructureParser(xml.sax.handler.ContentHandler):
    def __init__(self):
        self.text = ''
        self.path = []
        self.datalist = defaultdict(list)
        self.previouspath = ''

    def startElement(self, name, attrs):
        self.path.append(name)

    def endElement(self, name):
        strippedtext = self.text.strip()
        path = '/'.join(self.path)
        if strippedtext != '':
            if path == self.previouspath:
                # This handles the "Genre" tags in the sample file
                self.datalist[path][-1] += f',{strippedtext}'
            else:
                self.datalist[path].append(strippedtext)
        self.path.pop()
        self.text = ''
        self.previouspath = path

    def characters(self, content):
        self.text += content

You'd use this like this:

parser = StructureParser()

try:
    xml.sax.parse('uyalicihow.xml', parser)
except xml.sax.SAXParseException:
    print('File probably ended too soon')

This will read the example file just fine.

Once this has read and probably printed "File probably ended to soon", you have the parsed contents in parser.datalist.

You obviously want to have just the parts which read successfully, so you can figure out the shortest list and build a DataFrame with just those paths:

import pandas as pd

smallest_items = min(len(e) for e in parser.datalist.values())
df = pd.DataFrame({key: value for key, value in parser.datalist.items() if len(value) == smallest_items})

This gives something similar to your desired output:

  Items/Item/Main/Platform Items/Item/Main/PlatformID Items/Item/Main/Type 
0                   iTunes                  353736518            TVEpisode   
1                   iTunes                  495275084            TVEpisode

The columns for the test file which are matched here are

>> df.columns
Index(['Items/Item/Main/Platform', 'Items/Item/Main/PlatformID',
       'Items/Item/Main/Type', 'Items/Item/Main/TVSeriesID',
       'Items/Item/Info/BaseURL', 'Items/Item/Info/EpisodeNumber',
       'Items/Item/Info/HighestResolution',
       'Items/Item/Info/LanguageOfMetadata', 'Items/Item/Info/LastModified',
       'Items/Item/Info/Name', 'Items/Item/Info/ReleaseDate',
       'Items/Item/Info/ReleaseYear', 'Items/Item/Info/RuntimeInMinutes',
       'Items/Item/Info/SeasonNumber', 'Items/Item/Info/Studio',
       'Items/Item/Info/Synopsis', 'Items/Item/Genres/Genre',
       'Items/Item/Products/Product/URL'],
      dtype='object')

Based on your comments, it appears as though it is more important to you to have all the elements represented, but perhaps just showing a preview, in which case you can perhaps use only the first elements from the data. Note that in this case the Products entries won't match the Item entries.

df = pd.DataFrame({key: value[:smallest_items] for key, value in parser.datalist.items()})

Now we get all the paths:

>> df.columns
Index(['Items/Item/Main/Platform', 'Items/Item/Main/PlatformID',
       'Items/Item/Main/Type', 'Items/Item/Main/TVSeriesID',
       'Items/Item/Info/BaseURL', 'Items/Item/Info/EpisodeNumber',
       'Items/Item/Info/HighestResolution',
       'Items/Item/Info/LanguageOfMetadata', 'Items/Item/Info/LastModified',
       'Items/Item/Info/Name', 'Items/Item/Info/ReleaseDate',
       'Items/Item/Info/ReleaseYear', 'Items/Item/Info/RuntimeInMinutes',
       'Items/Item/Info/SeasonNumber', 'Items/Item/Info/Studio',
       'Items/Item/Info/Synopsis', 'Items/Item/Genres/Genre',
       'Items/Item/Products/Product/URL',
       'Items/Item/Products/Product/Offers/Offer/Price',
       'Items/Item/Products/Product/Offers/Offer/Currency'],
      dtype='object')

thanks, this is on the right track but missing a few things. First, when I do `df.columns` it misses about 20% of the entries. For example, it doesn't include `/Products` or anything of its children. Second, the paths look like this for me: `'html/body/div/div/button/Items/Item/Items/Item/Genres/Genre'`. Why does it start with `html` and not `Items` ? — David542, Dec 23 '18 at 19:10
Finally, it needs to work on truncated files -- the files usually will not be well-formed, as we're just grabbing the first 5MB of the file to parse the first 100 lines to show the user a preview (the files may be 10GB). — David542, Dec 23 '18 at 19:15
@David542 1. Did you use the `parser` to parse another file before testing the XML file you uploaded? It will "remember" all the files it parsed, so you need to make a new one (with `parser = StructureParser()`) for every file. 2. My examples were all done with the truncated file you uploaded, no problem with that. — chthonicdaemon, Dec 24 '18 at 04:28

score 1 · Answer 4 · answered Dec 23 '18 at 20:07

For very big files, reading is always a problem. I would suggest a simple algorithmic behavior for the reading of the file itself. The key point is always the xml tags inside the files. I would suggest you read the xml tags and sort them inside a heap and then validate the content of the heap accordingly.

Reading the file should also happen in chunks:

import xml.etree.ElementTree as etree
for event, elem in etree.iterparse(xmL, events=('start', 'end', 'start-ns', 'end-ns')):
  store_in_heap(event, element)

This will parse the XML file in chunks at a time and give it to you at every step of the way. start will trigger when a tag is first encountered. At this point elem will be empty except for elem.attrib that contains the properties of the tag. end will trigger when the closing tag is encountered, and everything in-between has been read.

you can also benefit from the namespaces that are in start-ns and end-ns. ElementTree has provided this call to gather all the namespaces in the file. Refer to this link for more information about namespaces

thanks for the suggestion. Are you able to provide a more precise example given the input given above? For example, how to parse the actual tags and flatten it, etc.? — David542, Dec 23 '18 at 20:14

score 0 · Answer 5 · answered Dec 03 '18 at 22:19

There are a number of tools around that will generate a schema from a supplied instance document. I don't know how many of them will work on a 5Gb input file, and I don't know how many of them can be invoked from Python.

Many years ago I wrote a Java-based, fully streamable tool to generate a DTD from an instance document. It hasn't been touched in years but it should still run: https://sourceforge.net/projects/saxon/files/DTDGenerator/7.0/dtdgen7-0.zip/download?use_mirror=vorboss

There are other tools listed here: Any tools to generate an XSD schema from an XML instance document?

B Lean · Answer 6 · 2019-01-07T23:10:13.220

As I see it your question is very clear. I give it a plus one up vote for clearness. You are wanting to parse text.

Write a little text parser, we can call that EditorB, that reads in chunks of the file or at least line by line. Then edit or change it as you like and re-save that chunk or line.

It can be easy in Windows from 98SE on. It should be easy in other operating systems.

The process is (1) Adjust (manually or via program), as you currently do, we can call this EditorA, that is editing your XML document, and save it; (2) stop EditorA; (3) Run your parser or editor, EditorB, on the saved XML document either manually or automatically (started via detecting that the XML document has changed via date or time or size, etc.); (4) Using EditorB, save manually or automatically the edits from step 3; (5) Have your EditorA reload the XML document and go on from there; (6) do this as often as is necessary, making edits with EditorA and automatically adjusting them outside of EditorA by using EditorB.

Edit this way before you send the file.

It is a lot of typing to explain, but XML is just a glorified text document. It can be easily parsed and edited and saved, either character by character or by larger amounts line by line or in chunks.

As a further note, this can be applied via entire directory contained documents or system wide documents as I have done in the past.

Make certain that EditorA is stopped before EditorB is allowed to start it's changing. Then stop EditorB before restarting EditorA. If you set this up as I described, then EditorB can be run continually in the background, but put in it an automatic notifier (maybe a message box with options, or a little button that is set formost on the screen when activated) that allows you to turn off (on continue with) EditorA before using EditorB. Or, as I would do it, put in a detector to keep EditorB from executing its own edits as long as EditorA is running.

B Lean

How to efficiently detect an XML schema without having the entire file in python

6 Answers6

Linked