1

I've been asked to look in to the possibility of converting XML to Python, so that the XML can be phased out of the current process & they'd be replaced with new python scripts.

Currently the XML files are used by python & ASP and look essentially like this;

<?xml version="1.0" encoding="UTF-8"?>
<script>
    <stage id="stage1" nextStage="stage2">
        <initialise>
            <variable id="year" value="2012" type="Integer"/>
            <executeMethod class="engineclass_date" method="getTodayAsString">
                <arguments>
                    <variable id="someVar"/>
                </arguments>
                <return>
                    <variable id="todaysDate"/>
                </return>
            </executeMethod>

Thanks to Pepr I've not got a parser/compiler in the works that produces Python code from the XML. It still needs a lot of work to handle all the possible elements, but it works, and hopefully others can learn from this like I am!

Still need to find a way to write a separate file for each stage of the script in order for the indent to function properly and handle errors properly.

Community
  • 1
  • 1
markwalker_
  • 12,078
  • 7
  • 62
  • 99

2 Answers2

3

If you have an appropriate XML Schema for these XML files, there are tools like GenerateDS which will generate python classes based on them.

That would allow you to load all of the files in memory, and have them as objects. How you then store that data elsewhere...well, you don't say what you want to do, but you could do anything you usually can with python.

Marcin
  • 48,559
  • 18
  • 128
  • 201
  • Yeah I saw that earlier, and it'd make life easier. But there aren't any. I'll have to have a look at generating or building one. – markwalker_ Apr 25 '12 at 12:14
  • I am not that good in transformations based on a XML Schema; however, the XML files contain the code that must be converted to Python. In my opinon, it is not possible with any XML Schema. But I may be wrong. – pepr Apr 25 '12 at 12:20
  • @pepr You misunderstand completely. XSD is not a transformation language, and I am not suggesting that you perform any transformations manually. GenerateDS takes XSD as input, and outputs python code. XSD describes the structure of XML files, which is what allows it to be used to generate code. You then use that code to load your xml into objects in memory. – Marcin Apr 25 '12 at 13:05
  • @marksweb If you don't already have XSD it will probably not save you much effort. – Marcin Apr 25 '12 at 13:07
  • I found a site to gen an XSD easily enough but the .py generated from the schema is overly complicated & it seems generateDS is a little overkill. I didn't look too far in to it but I saw a lot that will cause problems like defs with python module names & bits to create XML not create python from XML. – markwalker_ Apr 25 '12 at 13:27
  • @marksweb I wouldn't recommend it for general work, but I think it is a potentially viable way to get your xml into objects before you do something else with them. Of course, if it's proving fiddly, then it's not likely to be worth the time. – Marcin Apr 25 '12 at 13:29
  • 1
    @Marcin: I did not think about XML Schema as about transformation language. I am aware of that they are used to describe a structure and the rules of the class of XML documents. The problem is that knowing the structure may help to generate a rich data-structure equivalent via Python classes, but it does not contain the knowledge on how the script code should be reorganized and replaced by equivalents to generate the algorithms in the Python language. – pepr Apr 25 '12 at 13:47
  • @pepr Of course not, XSD has no algorithmic content, but then neither does XML data. No-one suggested it could generate a programme to do anything other than load data in an appropriate format. – Marcin Apr 25 '12 at 14:02
  • @Marcin: > No-one suggested it could generate a programme... Well, the question is not stated perfectly, but it can be deduced that this is the core of the question. – pepr Apr 25 '12 at 14:05
  • @pepr Really? The question contains nothing whatsoever about algorithmics. – Marcin Apr 25 '12 at 14:06
  • @pepr GenerateDS, used successfully would put OP in exactly the same position they would be if they followed your suggestions: they would have loaded some xml into objects. You don't get into any algorithmics either. – Marcin Apr 25 '12 at 14:07
  • @Marcin: OK. But for example the `import engineclass_date` is not generated from the XML. It is based on some extra knowledge. Also, have a look at the snippet published by marksweb at http://pastebin.com/vRRxfWiA There is some example of a code. – pepr Apr 25 '12 at 14:15
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/10505/discussion-between-pepr-and-marcin) – pepr Apr 25 '12 at 16:30
1

Use the standard xml.etree.Element tree to extract the information from XML to Python objects (or the more enhanced third party lxml with the same API).

I recommend to read the Mark Pilrim's Dive Into Python 3, Chapter 12. XML (http://getpython3.com/diveintopython3/xml.html).

Here is the core of how the parser/compiler could be written. The idea is to go recursively through the elements, collect the neccessary information and output the code when it is possible:

import xml.etree.ElementTree as ET

class Parser:

    def __init__(self):
        self.output_list = []  # collected output lines
        self.il = 0            # indentation level


    def __iter__(self):
        return iter(self.output_list)


    def out(self, s):
        '''Output the indented string to the output list.'''
        self.output_list.append('    ' * self.il + s)


    def indent(self, num=1):
        '''Increase the indentation level.'''
        self.il += num


    def dedent(self, num=1):
        '''Decrease the indentation level.'''
        self.il -= num


    def parse(self, elem):
        '''Call the parser of the elem.tag name.

        The tag name appended to "parse_" and then the name of that
        function is called.  If the function is not defined, then
        self.parse_undefined() is called.'''

        fn_name = 'parse_' + elem.tag
        try:
            fn = getattr(self, fn_name)
        except AttributeError:
            fn = self.parse_undefined
        return fn(elem)


    def loop(self, elem):
        '''Helper method to loop through the child elements.'''
        for e in elem:
            self.parse(e)


    def parseXMLfile(self, fname):
        '''Reads the XML file and starts parsing from the root element.'''
        tree = ET.parse(fname)
        script = tree.getroot()
        assert script.tag == 'script'
        self.parse(script)


    ###################### ELEMENT PARSERS #######################

    def parse_undefined(self, elem):
        '''Called for the element that has no parser defined.'''
        self.out('PARSING UNDEFINED for ' + elem.tag)


    def parse_script(self, elem):
        self.loop(elem)


    def parse_stage(self, elem):
        self.out('')
        self.out('Parsing the stage: ' + elem.attrib['id'])
        self.indent()
        self.loop(elem)
        self.dedent()


    def parse_initialise(self, elem):
        self.out('')
        self.out('#---------- ' + elem.tag + ' ----------')
        self.loop(elem)


    def parse_variable(self, elem):
        tt = str   # default type
        if elem.attrib['type'] == 'Integer': 
            tt = int
        # elif ... etc for other types

        # Conversion of the value to the type because of the later repr().
        value = tt(elem.attrib['value'])  

        id_ = elem.attrib['id']

        # Produce the line of the output.
        self.out('{0} = {1}'.format(id_, repr(value)))


    def parse_execute(self, elem):
        self.out('')
        self.out('#---------- ' + elem.tag + ' ----------')
        self.loop(elem)


    def parse_if(self, elem):
        assert elem[0].tag == 'condition'
        condition = self.parse(elem[0])
        self.out('if ' + condition + ':')
        self.indent()
        self.loop(elem[1:])
        self.dedent()


    def parse_condition(self, elem):
        assert len(elem) == 0
        return elem.text


    def parse_then(self, elem):
        self.loop(elem)


    def parse_else(self, elem):
        self.dedent()
        self.out('else:')
        self.indent()
        self.loop(elem)


    def parse_error(self, elem):
        assert len(elem) == 0
        errorID = elem.attrib.get('errorID', None)
        fieldID = elem.attrib.get('fieldID', None)
        self.out('error({0}, {1})'.format(errorID, fieldID))


    def parse_setNextStage(self, elem):
        assert len(elem) == 0
        self.out('setNextStage --> ' + elem.text)


if __name__ == '__main__':
    parser = Parser()
    parser.parseXMLfile('data.xml')
    for s in parser:
        print s

When used with the data pasted here http://pastebin.com/vRRxfWiA, the script produces the following output:

Parsing the stage: stage1

    #---------- initialise ----------
    taxyear = 2012
    taxyearstart = '06/04/2012'
    taxyearend = '05/04/2013'
    previousemergencytaxcode = '747L'
    emergencytaxcode = '810L'
    nextemergencytaxcode = '810L'

    ...

    maxLimitAmount = 0
    PARSING UNDEFINED for executeMethod
    if $maxLimitReached$ == True:
        employeepayrecord = 'N'
        employeepayrecordstate = '2'
    else:
        employeepayrecordstate = '1'
    gender = ''
    genderstate = '1'
    title = ''
    forename = ''
    forename2 = ''
    surname = ''
    dob = ''
    dobinvalid = ''

    #---------- execute ----------
    if $dobstring$ != "":
        validDOBCheck = 'False'
        PARSING UNDEFINED for executeMethod
        if $validDOBCheck$ == False:
            error(224, dob)
        else:
            minimumDOBDate = ''
            PARSING UNDEFINED for executeMethod
            validDOBCheck = 'False'
            PARSING UNDEFINED for executeMethod
            if $validDOBCheck$ == False:
                error(3007161, dob)
        if $dobstring$ == "01/01/1901":
            error(231, dob)
    else:
        error(231, dob)

Parsing the stage: stage2

    #---------- initialise ----------
    address1 = ''

    ...
pepr
  • 20,112
  • 15
  • 76
  • 139
  • Thank You. You have indeed understood correctly. Thats a great help. I don't yet know why the `import` is required, but this is an app of near 970k LoC so I'm just going with the info I've been given so far as this is a bit 'in at the deep end' for me. I've posted a snippet of the first few stages in an XML file; http://pastebin.com/vRRxfWiA – markwalker_ Apr 25 '12 at 11:18
  • @marksweb: It seems that it would be the best to edit your question, edit the stage4 part of the snippet (snip the repated things of the same kind, improve the indentation for better readability), and show how it should be translated. I do not know the APSX. What the *initialise* exactly means, etc. Do you know how the equivalent Python code should look like? – pepr Apr 25 '12 at 12:23
  • This doesn't have to be written in python, just create python. So go through the xml file line, by line and create in a new file the python equivalent for that line of xml. – markwalker_ Apr 25 '12 at 13:32
  • @marksweb: Actually, Python is a good language to write the transformation of the program. The problem is that the apsx-script code must be transformed to the Python equivalent code for the server. It cannot be done line-by-line, because both languages differ. You must understand what the original did (the semantics) and how to write the Python equivalent that does the same. Say, assigning the variable is a simple fragment. The executemethod is not that simple. – pepr Apr 25 '12 at 14:10
  • Right I see what you're getting at now. If the ASP side is ignored, all I'm trying to do is create the python and ignore anything else that might use the XML. – markwalker_ Apr 25 '12 at 14:28
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/10532/discussion-between-marksweb-and-pepr) – markwalker_ Apr 26 '12 at 08:31
  • I am just curious why this was downvoted. Any constructive comment? – pepr May 11 '12 at 06:32