0

With MiFID 2 introduced, I would like to analyze the LEI data from GLEIF.

The data is in XML format, but boy! It is hard to parse.

I tried the code (see below), which freezes my machine almost completely and then gives this error:

AttributeError: no such child: {http://www.gleif.org/data/schema/leidata/2016}pyval. 

The structure of the data is really simple, but the files are large. Nevertheless, I think the main culprit is the use of special characters, i.e. the colon "lei:" in the tags, see this shortened example:

<lei:LEIData xmlns:gleif="http://www.gleif.org/concatenated-file/header-extension/2.0" xmlns:lei="http://www.gleif.org/data/schema/leidata/2016">
    <lei:LEIRecords>
         <lei:LEIRecord>
              <lei:LEI>029200137F2K8AH5C573</lei:LEI>
         </lei:LEIRecord>
     </lei:LEIRecords>
</lei:LEIData>

Any help?

I posted a larger sample on pastebin: https://pastebin.com/UbrM5mVp after having eliminated the lei:LEIHeader section.

See the python code below (borrowed from Wes McKinney's book, Section 6.1):

from lxml import objectify
path = '20180104-gleif-concatenated-file-lei2.xml'
data = []
parsed = objectify.parse(open(path))
root = parsed.getroot()

for child in root:
    print(child.tag, child.attrib)

for elt in root.INDICATOR:
    el_data = {}
    for child in elt.getchildren():
        el_data[child.tag] = child.pyval
    data.append(el_data)

perf = pd.DataFrame(data)
Martien Lubberink
  • 2,614
  • 1
  • 19
  • 31
  • Problem most likely is it's namespaced XML. You could use ElementTree https://stackoverflow.com/a/14853417/1207049 – marekful Jan 06 '18 at 01:32
  • An old and quite unwell formatted blog post about how to do it using SAX and minidom https://www.xml.com/pub/a/2003/03/10/python.html – marekful Jan 06 '18 at 01:34
  • The error has to include more information than that, such as the line number in the XML where the error occurred. What line causes the error? – Jim Garrison Jan 06 '18 at 05:08
  • Jim, that attribute error is the only error that my console shows. – Martien Lubberink Jan 06 '18 at 21:28
  • I tried some of your suggestions, but not there yet. Perhaps it help to somehow include the corresponding XML schema definition, which can be downloaded separately here: https://www.gleif.org/en/about-lei/common-data-file-format/lei-cdf-format/lei-cdf-format-version-2-1 – Martien Lubberink Jan 06 '18 at 21:35

1 Answers1

1

Had the same problem:

import xmltodict

with open('LEIStuff.xml',encoding="utf8") as datafile:

doc = xmltodict.parse(datafile.read())

for row in doc['LEIData']['LEIRecords']['LEIRecord']:

    try:

        #Lei Number
        print(row['LEI'],'-----------------------------')
    except Exception:
                   pass

    try:
        #LegalNumber
        print(row['Entity']['LegalName']['#text'])
        print('===[LegalName]===')
    except Exception:
                   pass

[...] and all the fields you would like to use: from this website

https://www.gleif.org/en/about-lei/common-data-file-format/lei-cdf-format/lei-cdf-format-version-2-1

EDIT:

AttributeError: no such child - the LEI XML data it is not always the same - the structure of each record is s bit different - sometimes it as and extra field or missing field - I have sorted it out using "TRY:" in python but it has to be done per field not per row (I think).

EDIT ver2:

When you are trying to get the data from the concatenated LEI file the structure is different than in the small fies..

you have then to use:

for row in doc['lei:LEIData']['lei:LEIRecords']['lei:LEIRecord']:

but when you trying to access the fields with #text then:

LegalName =  row['lei:Entity']['lei:LegalName']['#text'] 

rest of the fields will be like that: Legal_PostalCode = row['lei:Entity']['lei:LegalAddress']['lei:PostalCode']

hope that's help.

Lemon
  • 141
  • 2
  • 7
  • Your answer is almost only code and links. Please add some more information on how your code solves the question asked. – NOhs Mar 14 '18 at 12:11
  • It works, kind of, but leads to a memory overflow problem. As a result, I cannot use your solution. – Martien Lubberink Jun 20 '18 at 08:09
  • I know - you could use Pagination do split the XML. I haven't implemented it as I am using machine with 80GB or memory. – Lemon Jun 21 '18 at 12:21