2

I am new to python and trying to modify some xml configuration files which are present in my local system.

Input: I have an xml file(say Test.xml) with the following content.

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <JavaHost xmlns="SomeInfo/v1.1">
        <Domain>
           <MessageProcessor>
              <!-- This comment should not be removed and all formating should be untouched -->
              <SocketTimeout>500</SocketTimeout>
           </MessageProcessor>
            <!-- This comment should not be removed and all formating should be untouched -->
           <Composer>
                <SocketTimeout>5000</SocketTimeout>
                <Enabled>true</Enabled>
           </Composer> 
       </Domain>
    </JavaHost>

WHAT I WANT TO ACHIEVE: I want to achieve below 2 things:

Part 1: I want to modify value of SocketTimeout tag(only under composer tag) to 60 and also want to add a comment like this (foe e.g. Changed this value to reduce SocketTimeout). Hence the file Test.xml should be as below:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <JavaHost xmlns="SomeInfo/v1.1">
       <MessageProcessor>
          <!-- This comment should not be removed and all formating should be untouched -->
          <SocketTimeout>500</SocketTimeout>
       </MessageProcessor>
        <!-- This comment should not be removed and all formating should be untouched -->
       <Composer>
       <!-- Changed this value to reduce SocketTimeout -->
            <SocketTimeout>60</SocketTimeout>
            <Enabled>true</Enabled>
       </Composer>
   </Domain>
</JavaHost>

Part 2: In the file Test.xml, I want to add a new tag under Domain tag as below:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <JavaHost xmlns="SomeInfo/v1.1">
       <MessageProcessor>
          <!-- This comment should not be removed and all formating should be untouched -->
          <SocketTimeout>500</SocketTimeout>
       </MessageProcessor>
       <!-- comment should not be removed and all formatting should be untouched -->
       <Composer>
       <!-- Changed this value to reduce SocketTimeout -->
            <SocketTimeout>60</SocketTimeout>
            <Enabled>true</Enabled>
       </Composer>
       <New_tag>
       <!-- New Tag -->
            <Enabled>true</Enabled>
       </New_tag>
   </Domain>
</JavaHost>

That’s all I want :)

WHAT I HAVE TRIED:

To achieve this task I considered below optons:

Minidom/ElementTree/lxml removes comments in the file and also changes the formatting of the file.

Regex: Doesn’t removes comments, also doesn’t disturb formatting. Hence, I opted for regex and below is what I started with, but is not working :(

import os, re
# set the working directory 
os.chdir('C:\\Users\\Dell\\Desktop\\')

# open the source file and read it
fh = open('C:\\Users\\Dell\\Desktop\\Test.xml', 'r')
subject = fh.read()
fh.close()

pattern = re.compile(r"\[<Composer>\].*?\[/<Composer>\]")
#Replace
result = pattern.sub(lambda match: match.group(0).replace('<SocketTimeout>500</SocketTimeout>','<SocketTimeout>60</SocketTimeout>') ,subject)

# write the file
f_out = open('C:\\Users\\Dell\\Desktop\\Test.xml', 'w')
f_out.write(result)
f_out.close()

Any idea in implementing what I want or rectification in mistakes would be highly appreciable. Although I am new to python but will try my best to work on the suggestions.

Rahul
  • 191
  • 1
  • 11
  • 1
    Show the lxml code you tried which you claim does not preserve comments and formatting. – ekhumoro Feb 25 '18 at 18:15
  • Thanks ekhumoro for the quick reply, I haven't tried lxml but what I got to know from various questions and answers here on stakoverflow is that it removes comments. If this is not the case, would it be possible for you to share the code snippet to achieve this? I am new to this,and learning, but I can try editing and writing the code if I get some short code like an algo. – Rahul Feb 25 '18 at 18:34
  • 1
    related: https://stackoverflow.com/a/1732454/2730399 – Azsgy Feb 25 '18 at 19:00
  • No probs ekhumoro – Rahul Feb 25 '18 at 19:38

2 Answers2

2

This is not exactly what you wanted but it's close. For one thing, avoid regex for xml, html and similar processing like the plague. At the same time, don't be surprised if you find occasional 'challenges' in using products like lxml.

I think, this time, I found a bug.

from lxml import etree
tree = etree.parse('shivam.xml')
element_to_change = tree.xpath('.//Composer/SocketTimeout')[0]
print(element_to_change)
element_to_change.text='60'
comment_will_follow_this = tree.xpath('.//Composer')[0]
print(comment_will_follow_this)
comment = etree.Comment('This did not work')
comment_will_follow_this.append(comment)

comment = etree.Comment('Changed this value to reduce SocketTimeout')
element_to_change.addprevious(comment)

tree.write('see_it.xml', pretty_print=True)
  • I used xpath to find the element to change, and the places in the file to receive the comments.
  • The append method is supposed to add a comment or other element to a given element as a child. However, I found in this case that the 'This did not work' comment was added as a preceding element comment.
  • However, I did find that addprevious was able to add the comment in the desired location, the fly in the ointment being that it fails to place an end-line between the comment and the next xml element.

Here's the resulting file. Indicidentally, you will note that the original comments are intact.

<JavaHost>
    <Domain>
       <MessageProcessor>
          <!-- This comment should not be removed and all formating should be untouched -->
          <SocketTimeout>500</SocketTimeout>
       </MessageProcessor>
        <!-- This comment should not be removed and all formating should be untouched -->
       <Composer>
            <!--Changed this value to reduce SocketTimeout--><SocketTimeout>60</SocketTimeout>
            <Enabled>true</Enabled>
       <!--This did not work--></Composer> 
   </Domain>
</JavaHost>
Bill Bell
  • 21,021
  • 5
  • 43
  • 58
  • Awesome, Thanks Bill, This should help me even if doesn't do all what I need. I am trying this now and it should be good to start with. I will post the updates. – Rahul Feb 25 '18 at 19:43
  • Hello Bill, This code worked fine however I am facing below issues: Apologize for not providing exact input file earlier.I have edited it now. 1.The input file actually contains some declaration i.e. (xml_declaration = True, encoding='UTF-8', standalone="no") This was not getting saved, hence i modified it a bit like below and it worked. tree.write('see_it.xml', pretty_print=True, xml_declaration = True, encoding='UTF-8', standalone="no") – Rahul Feb 25 '18 at 21:48
  • 2.Input file also contains namespaces info as below: If i delete this and make it only the code is working fine, but i don't want to delete this. Any other way to do this? – Rahul Feb 25 '18 at 21:49
  • 3.Comments are getting added without NewLine. 60 For this I modified the code as below: parser = etree.XMLParser(remove_blank_text=True) tree = etree.parse('C:\\Users\\Dell\\Desktop\\Test.xml',parser) But this will remove all blank lines which I don't want to. – Rahul Feb 25 '18 at 21:49
  • And wanted the comments to be added with NewLine : New_Line_Here 60 – Rahul Feb 25 '18 at 21:58
2

Since you used modify and XML in same sentence, consider XSLT, the special-purpose language designed to modify XML files. Python's lxml can run XSLT 1.0 scripts as well as external processors or other languages that Python can call at command line. So, XSLT is portable! Even more, Python can pass parameters to XSLT in case 50 needs to be dynamically adjusted -very similar to parameters in the other special-purpose language, SQL, of which Python has many APIs.

Specifically, XSLT maintains the <xsl:comment> command and can append or rewrite nodes to trees. Also, as commented, linked, and hopefully web search recommended, it is highly ill-adivsed to use regex on X|HTML documents being non-natural languages. Hence, DOM libraries like Python's etree, lxml, minidom are preferred, of course XSLT too that adheres to W3C standards.

XSLT (save as .xsl file, a special .xml file)

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="xml" omit-xml-declaration="yes" indent="yes"/>
    <xsl:strip-space elements="*"/>

    <xsl:template match="node()|@*|comment()">
     <xsl:copy>
       <xsl:apply-templates select="node()|@*|comment()"/>
     </xsl:copy>
    </xsl:template>

    <xsl:template match="Domain">
     <xsl:copy>       
       <xsl:apply-templates select="*|@*|comment()"/>
       <New_tag>
         <xsl:comment>New Tag</xsl:comment>
         <Enabled>true</Enabled>
       </New_tag>
     </xsl:copy>
    </xsl:template>

    <xsl:template match="Composer">
     <xsl:copy>
       <xsl:comment>Changed this value to reduce SocketTimeout</xsl:comment>
       <SocketTimeout>50</SocketTimeout>
       <xsl:apply-templates select="Enabled"/>
     </xsl:copy>
    </xsl:template>

</xsl:stylesheet>

Python

import lxml.etree as et

# LOAD XML AND XSLT
dom = et.parse('Input.xml') 
xslt = et.parse('XSLT_Script.xsl')

# TRANSFORM SOURCE
transform = et.XSLT(xslt)
newdom = transform(dom)

# OUTPUT TO CONSOLE
print(newdom)

# OUTPUT TO FILE
with open('Output.xml', 'wb') as f:
    f.write(newdom)

Output

<JavaHost>
  <Domain>
    <MessageProcessor>
      <!-- This comment should not be removed and all formating should be untouched -->
      <SocketTimeout>500</SocketTimeout>
    </MessageProcessor>
    <!-- This comment should not be removed and all formating should be untouched -->
    <Composer>
      <!--Changed this value to reduce SocketTimeout-->
      <SocketTimeout>50</SocketTimeout>
      <Enabled>true</Enabled>
    </Composer>
    <New_tag>
      <!--New Tag-->
      <Enabled>true</Enabled>
    </New_tag>
  </Domain>
</JavaHost>
Parfait
  • 104,375
  • 17
  • 94
  • 125