3

I am working on a xml parser. The goal is to parse a number of different xml files where prefixes and tags remain consistent but namespaces change.

I am hence trying either:

  • to parse the xml just by <prefix:tags> without resolving (replacing) the prefix with the namespace. The prefixes remain unchanged from document to document.
  • to load automatically the namespaces so that the identifier (<prefix:tag>) could be replaced with the proper namespace.
  • just parse the xml by tag

I have tried with xml.etree.ElementTree.

I also had a look at lxml I did not find any configuration option of the XMLParser in lxml that could help me out although here I could read an answer where the author suggests that lxml should be able to collect namespaces for me automatically.

Interestingly, parsed_file = etree.XML(file) fails with the error:

lxml.etree.XMLSyntaxError: Start tag expected, '<' not found, line 1, column 1

One example of the files I would like to parse is here

Community
  • 1
  • 1
NoIdeaHowToFixThis
  • 4,484
  • 2
  • 34
  • 69

2 Answers2

2

Do not care about ns prefixes, care about complete namespaces

Sometime, people do care about those short prefixes and forgetting, the are of secondary importance. They are only short reference to fully qualified namespace. E.g.

xmlns:trw="http://www.trw.com/20131231"

in xml means, from now on, the "trw:" stands for fully qualified namespace "http://www.trw.com/20131231". Note, that this prefix can be redefined to any other namespace in any following element and may get completely different meaning.

On the other hand, when you care about real meaning, what means here fully qualified namespace, you may think of "trw:row" as "{http://www.trw.com/20131231}row". This translated meaning will be reliable and will not change with prefix changes.

Parsing referred xml

The link to http://edgar.sec.gov/Archives/edgar/data/1267097/000104746914000925/trw-20131231.xml leads to an xml, which validates by xmlstarlet and which lxml is able to parse.

The error message you show is referring to very first character of the stream, so chances are you either met BOM byte in your file, or you are trying to read xml, which is gzipped and shall be decompressed first.

lxml and namespaces

lxml works with namespaces well. It allows you to use XPath expressions, which use namespaces. With controlling namspace prefix on output it is a bit more complex, as it is dependent on xmlns attributes, which are part of serialized document. If you want to modify the prefixes, you must somehow organize these xmlns attributes, often by moving all of the to the root element. At the same time, lxml keeps track of fully qualified namespace of each element, so at the moment of serialization, it will respect this full name as well as currently valid prefix for this namespace.

Handling these xmlna attributes is a bit of more code, refer to lxml documentation.

Falko
  • 17,076
  • 13
  • 60
  • 105
Jan Vlcinsky
  • 42,725
  • 12
  • 101
  • 98
1
items = tree.xpath("*[local-name(.) = 'a_tag_goes_here']")

did the job. On top of that I had to browse the generated list items manually to define my other desired filtering functions.

NoIdeaHowToFixThis
  • 4,484
  • 2
  • 34
  • 69