3

I'm writing a script in PowerShell that modifies an XML file. I haven't really worked with XML before so I'm muddling my way through this. I figured out how to load, search, insert elements and attributes, and save changes. The problem I'm running into is when I save the changes, the formatting of the original XML file isn't preserved. The namespace lines in particular are getting butchered pretty badly. For some additional context, I'm working with the Apache Tomcat web.xml file located in the conf folder.

Below is a snippet of the original XML file with some lines omitted to give you an idea of the original formatting:

<?xml version="1.0" encoding="UTF-8"?>
<!--
  Licensed to the Apache Software Foundation (ASF) under one or more
  contributor license agreements.  See the NOTICE file distributed with
  this work for additional information regarding copyright ownership.
  The ASF licenses this file to You under the Apache License, Version 2.0
  (the "License"); you may not use this file except in compliance with
  the License.  You may obtain a copy of the License at

      http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License.
-->
<web-app xmlns="http://xmlns.jcp.org/xml/ns/javaee"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://xmlns.jcp.org/xml/ns/javaee
                      http://xmlns.jcp.org/xml/ns/javaee/web-app_4_0.xsd"
  version="4.0">
  


  <!-- ======================== Introduction ============================== -->
  <!-- This document defines default values for *all* web applications      -->
  <!-- loaded into this instance of Tomcat.  As each application is         -->
  <!-- deployed, this file is processed, followed by the                    -->
  <!-- "/WEB-INF/web.xml" deployment descriptor from your own               -->
  <!-- applications.                                                        -->
  <!--                                                                      -->
  <!-- WARNING:  Do not configure application-specific resources here!      -->
  <!-- They should go in the "/WEB-INF/web.xml" file in your application.   -->


  <!-- ================== Built In Servlet Definitions ==================== -->


  <!-- The default servlet for all web applications, that serves static     -->
  <!-- resources.  It processes all requests that are not mapped to other   -->
  <!-- servlets with servlet mappings (defined either here or in your own   -->
  <!-- web.xml file).  This servlet supports the following initialization   -->
  <!-- parameters (default values are in square brackets):                  -->

    <servlet>
        <servlet-name>default</servlet-name>
        <servlet-class>org.apache.catalina.servlets.DefaultServlet</servlet-class>
        <init-param>
            <param-name>debug</param-name>
            <param-value>0</param-value>
        </init-param>
        <init-param>
            <param-name>listings</param-name>
            <param-value>false</param-value>
        </init-param>
        <load-on-startup>1</load-on-startup>
    </servlet>

</web-app>

I tried a bunch of things with different results, none of which are satisfactory. I want to insert some elements underneath the element and save the changes, keeping the original formatting as it appears in the snippet above. The problem isn't related to the edits I'm making, as I tried loading in the XML file and immediately saving. I found that upon saving the formatting got butchered in one way or another, depending on what I tried.

I've been using the .NET System.Xml.XmlDocument class to load and save the XML file. I also tried using the XmlWriter and XmlWritterSettings classes.

Here are the things I've tried and the results.

Code:

$webXml = New-Object System.Xml.XmlDocument
$xmlPath = "C:\path\to\web.xml"
$xmlDoc.Load($xmlPath)
$webXml.Save($xmlPath)

Result:

<?xml version="1.0" encoding="UTF-8"?>
<!--
  Licensed to the Apache Software Foundation (ASF) under one or more
  contributor license agreements.  See the NOTICE file distributed with
  this work for additional information regarding copyright ownership.
  The ASF licenses this file to You under the Apache License, Version 2.0
  (the "License"); you may not use this file except in compliance with
  the License.  You may obtain a copy of the License at

      http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License.
-->
<web-app xmlns="http://xmlns.jcp.org/xml/ns/javaee" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://xmlns.jcp.org/xml/ns/javaee&#xD;&#xA;                      http://xmlns.jcp.org/xml/ns/javaee/web-app_4_0.xsd" version="4.0">
  <!-- ======================== Introduction ============================== -->
  <!-- This document defines default values for *all* web applications      -->
  <!-- loaded into this instance of Tomcat.  As each application is         -->
  <!-- deployed, this file is processed, followed by the                    -->
  <!-- "/WEB-INF/web.xml" deployment descriptor from your own               -->
  <!-- applications.                                                        -->
  <!--                                                                      -->
  <!-- WARNING:  Do not configure application-specific resources here!      -->
  <!-- They should go in the "/WEB-INF/web.xml" file in your application.   -->
  <!-- ================== Built In Servlet Definitions ==================== -->
  <!-- The default servlet for all web applications, that serves static     -->
  <!-- resources.  It processes all requests that are not mapped to other   -->
  <!-- servlets with servlet mappings (defined either here or in your own   -->
  <!-- web.xml file).  This servlet supports the following initialization   -->
  <!-- parameters (default values are in square brackets):                  -->
  <servlet>
    <servlet-name>default</servlet-name>
    <servlet-class>org.apache.catalina.servlets.DefaultServlet</servlet-class>
    <init-param>
      <param-name>debug</param-name>
      <param-value>0</param-value>
    </init-param>
    <init-param>
      <param-name>listings</param-name>
      <param-value>false</param-value>
    </init-param>
    <load-on-startup>1</load-on-startup>
  </servlet>
</web-app>

Code:

$webXml = New-Object System.Xml.XmlDocument
$webXml.PreserveWhitespace = $true
$xmlPath = "C:\path\to\web.xml"
$xmlDoc.Load($xmlPath)
$webXml.Save($xmlPath)

Result:

<?xml version="1.0" encoding="UTF-8"?>
<!--
  Licensed to the Apache Software Foundation (ASF) under one or more
  contributor license agreements.  See the NOTICE file distributed with
  this work for additional information regarding copyright ownership.
  The ASF licenses this file to You under the Apache License, Version 2.0
  (the "License"); you may not use this file except in compliance with
  the License.  You may obtain a copy of the License at

      http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License.
-->
<web-app xmlns="http://xmlns.jcp.org/xml/ns/javaee" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://xmlns.jcp.org/xml/ns/javaee&#xD;&#xA;                      http://xmlns.jcp.org/xml/ns/javaee/web-app_4_0.xsd" version="4.0">


  <!-- ======================== Introduction ============================== -->
  <!-- This document defines default values for *all* web applications      -->
  <!-- loaded into this instance of Tomcat.  As each application is         -->
  <!-- deployed, this file is processed, followed by the                    -->
  <!-- "/WEB-INF/web.xml" deployment descriptor from your own               -->
  <!-- applications.                                                        -->
  <!--                                                                      -->
  <!-- WARNING:  Do not configure application-specific resources here!      -->
  <!-- They should go in the "/WEB-INF/web.xml" file in your application.   -->


  <!-- ================== Built In Servlet Definitions ==================== -->


  <!-- The default servlet for all web applications, that serves static     -->
  <!-- resources.  It processes all requests that are not mapped to other   -->
  <!-- servlets with servlet mappings (defined either here or in your own   -->
  <!-- web.xml file).  This servlet supports the following initialization   -->
  <!-- parameters (default values are in square brackets):                  -->
    <servlet>
        <servlet-name>default</servlet-name>
        <servlet-class>org.apache.catalina.servlets.DefaultServlet</servlet-class>
        <init-param>
            <param-name>debug</param-name>
            <param-value>0</param-value>
        </init-param>
        <init-param>
            <param-name>listings</param-name>
            <param-value>false</param-value>
        </init-param>
        <load-on-startup>1</load-on-startup>
    </servlet>

</web-app>

Code:

$xmlDoc = New-Object System.Xml.XmlDocument
$xmlPath = "C:\path\to\web.xml"
$xmlDoc.Load($xmlPath)

# Create a new instance of XmlWriterSettings and set the properties
$settings.Indent = $true
$settings.IndentChars = "`t"
$settings.NewLineChars = "`r`n"
$settings.NewLineHandling = [System.Xml.NewLineHandling]::Replace
$settings.Encoding = [System.Text.Encoding]::UTF8

# Create a new instance of XmlWriter and save the document
$writer = [System.Xml.XmlWriter]::Create($xmlPath, $settings)
$xmlDoc.Save($writer)
$writer.Flush()
$writer.Close()

Result:

<?xml version="1.0" encoding="utf-8"?>
<!--
  Licensed to the Apache Software Foundation (ASF) under one or more
  contributor license agreements.  See the NOTICE file distributed with
  this work for additional information regarding copyright ownership.
  The ASF licenses this file to You under the Apache License, Version 2.0
  (the "License"); you may not use this file except in compliance with
  the License.  You may obtain a copy of the License at

      http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License.
-->
<web-app xmlns="http://xmlns.jcp.org/xml/ns/javaee" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://xmlns.jcp.org/xml/ns/javaee&#xD;&#xA;                      http://xmlns.jcp.org/xml/ns/javaee/web-app_4_0.xsd" version="4.0">
    <!-- ======================== Introduction ============================== -->
    <!-- This document defines default values for *all* web applications      -->
    <!-- loaded into this instance of Tomcat.  As each application is         -->
    <!-- deployed, this file is processed, followed by the                    -->
    <!-- "/WEB-INF/web.xml" deployment descriptor from your own               -->
    <!-- applications.                                                        -->
    <!--                                                                      -->
    <!-- WARNING:  Do not configure application-specific resources here!      -->
    <!-- They should go in the "/WEB-INF/web.xml" file in your application.   -->
    <!-- ================== Built In Servlet Definitions ==================== -->
    <!-- The default servlet for all web applications, that serves static     -->
    <!-- resources.  It processes all requests that are not mapped to other   -->
    <!-- servlets with servlet mappings (defined either here or in your own   -->
    <!-- web.xml file).  This servlet supports the following initialization   -->
    <!-- parameters (default values are in square brackets):                  -->
    <servlet>
        <servlet-name>default</servlet-name>
        <servlet-class>org.apache.catalina.servlets.DefaultServlet</servlet-class>
        <init-param>
            <param-name>debug</param-name>
            <param-value>0</param-value>
        </init-param>
        <init-param>
            <param-name>listings</param-name>
            <param-value>false</param-value>
        </init-param>
        <load-on-startup>1</load-on-startup>
    </servlet>
</web-app>

Code:

# Load the XML document
$xmlDoc = New-Object System.Xml.XmlDocument
$xmlPath = "C:\path\to\web.xml"
$xmlDoc.Load($xmlPath)

# Create an XmlWriterSettings object with specified settings
$settings = New-Object System.Xml.XmlWriterSettings
$settings.Indent = $true
$settings.IndentChars = " "
$settings.NewLineChars = [Environment]::NewLine
$settings.NewLineHandling = [System.Xml.NewLineHandling]::Replace
$settings.OmitXmlDeclaration = $true
$settings.Encoding = New-Object System.Text.UTF8Encoding($false)

# Save the XML document with the specified settings
$writer = [System.Xml.XmlWriter]::Create($xmlPath, $settings)
$xmlDoc.Save($writer)
$writer.Close()

Result:

<!--
  Licensed to the Apache Software Foundation (ASF) under one or more
  contributor license agreements.  See the NOTICE file distributed with
  this work for additional information regarding copyright ownership.
  The ASF licenses this file to You under the Apache License, Version 2.0
  (the "License"); you may not use this file except in compliance with
  the License.  You may obtain a copy of the License at

      http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License.
-->
<web-app xmlns="http://xmlns.jcp.org/xml/ns/javaee" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://xmlns.jcp.org/xml/ns/javaee&#xD;&#xA;                      http://xmlns.jcp.org/xml/ns/javaee/web-app_4_0.xsd" version="4.0">
 <!-- ======================== Introduction ============================== -->
 <!-- This document defines default values for *all* web applications      -->
 <!-- loaded into this instance of Tomcat.  As each application is         -->
 <!-- deployed, this file is processed, followed by the                    -->
 <!-- "/WEB-INF/web.xml" deployment descriptor from your own               -->
 <!-- applications.                                                        -->
 <!--                                                                      -->
 <!-- WARNING:  Do not configure application-specific resources here!      -->
 <!-- They should go in the "/WEB-INF/web.xml" file in your application.   -->
 <!-- ================== Built In Servlet Definitions ==================== -->
 <!-- The default servlet for all web applications, that serves static     -->
 <!-- resources.  It processes all requests that are not mapped to other   -->
 <!-- servlets with servlet mappings (defined either here or in your own   -->
 <!-- web.xml file).  This servlet supports the following initialization   -->
 <!-- parameters (default values are in square brackets):                  -->
 <servlet>
  <servlet-name>default</servlet-name>
  <servlet-class>org.apache.catalina.servlets.DefaultServlet</servlet-class>
  <init-param>
   <param-name>debug</param-name>
   <param-value>0</param-value>
  </init-param>
  <init-param>
   <param-name>listings</param-name>
   <param-value>false</param-value>
  </init-param>
  <load-on-startup>1</load-on-startup>
 </servlet>
</web-app>

I'm stumped. Any help would be appreciated!

Ken White
  • 123,280
  • 14
  • 225
  • 444
SyncErr0r
  • 53
  • 5
  • 1
    Better question: why do you care? What process do you have that depends on insignificant whitespace? – Charlieface Mar 19 '23 at 01:31
  • Can you clarify *got butchered in one way or another*? Is it just the line breaks in comments? – Parfait Mar 19 '23 at 20:46
  • @Charlieface I care because humans will read it, and when the formatting is not preserved it may cause confusion, plus it looks ugly. – SyncErr0r Mar 20 '23 at 04:45
  • @Parfait see result examples. I'm looking to keep the original formatting for the reasons mentioned above. – SyncErr0r Mar 20 '23 at 04:45

3 Answers3

2

The - opt-in - insignificant-whitespace-preservation features of both System.Xml.XmlDocument ([xml] in PowerShell) and System.Xml.Linq.XDocument:

  • DO apply to whitespace between and inside elements.

  • Do NOT apply to the whitespace inside an element's opening tag, i.e. do not apply to whitespace between attributes.

Therefore, a multi-line opening tag such as:

<web-app xmlns="http://xmlns.jcp.org/xml/ns/javaee"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://xmlns.jcp.org/xml/ns/javaee
                      http://xmlns.jcp.org/xml/ns/javaee/web-app_4_0.xsd"
  version="4.0">

invariably becomes a single-line opening tag with:

  • attributes separated with a single space
  • newlines in attribute values escaped as &#xD;&#xA; (if the input file has Windows-format CRLF newlines) or &#xA; (if it has Unix-format LF newlines):
<web-app xmlns="http://xmlns.jcp.org/xml/ns/javaee" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://xmlns.jcp.org/xml/ns/javaee&#xD;&#xA;                      http://xmlns.jcp.org/xml/ns/javaee/web-app_4_0.xsd" version="4.0">

In a pinch, you can perform your own plain-text post-processing, which needless to say, is brittle, likely document-specific, and merely tries to recreate the original whitespace - which presumes that its format is known.

That said, with a specific document format such as yours it may work (using [xml] == System.Xml.XmlDocument):

# Note: Be sure to use a *full path*, because .NET's working dir.
#       usually differs from PowerShell's.
$xmlPath = "C:\path\to\web.xml"

# Load the document, with insignificant whitespace preserved.
($webXml = [xml]::new()).PreserveWhitespace = $true
$webXml.Load($xmlPath)

# ... modify it

# ... and save it.
$webXml.Save($xmlPath)

# Post-processing:
# "Re-pretty" the <web-app> element.
# Note: Be sure to match the actual encoding of the file.
$nl = [Environment]::NewLine
(Get-Content -Encoding utf8 $xmlPath) |
  ForEach-Object {
    if ($_ -match '^<web-app ') {
      $_ -replace '(?<=" )', "$nl  " -replace '(&#xD;)?&#xA;', $nl
    } else {
      $_
    }
  } | 
  Set-Content -Encoding uf8 $xmlPath

Note:

  • The post-processing assumes that the input file uses the platform-native newline format, and the resulting file will use that format.

    • Even if that assumption doesn't hold, that should usually not present a problem; ensuring that the input file's original newline format is possible, but requires more work.
  • In Windows PowerShell (unlike in PowerShell (Core) 7+), Set-Content -Encoding utf8 invariably creates a UTF-8 file with BOM.

    • This shouldn't be a problem for standards-compliant XML processors, but if it is, see this answer for how to create BOM-less UTF-8 in Windows PowerShell.
mklement0
  • 382,024
  • 64
  • 607
  • 775
  • I think the question is actually a duplicate with [Powershell saving XML and preserving format](https://stackoverflow.com/q/8160613/1701026), I was just about to close it... – iRon Mar 18 '23 at 08:47
  • 1
    Thanks, @iRon - I hadn't seen the other one. They're clearly closely related, and one of the answers mentions the specific problem as an aside, but given the distinct focus of this question - preserve _all_ insignificant whitespace, including _between attributes_ - I think it deserves its own answers - not least to see if there are _solutions_. Also, I've improved the answer. – mklement0 Mar 18 '23 at 13:16
  • Yes, I want to preserve all formatting. People will be reading and editing the file by hand potentially. The original file is formatted in an easy to read way because it's intended to be edited by hand. I have to make the same edits to a bunch of machines, so that is why I'm being particular. – SyncErr0r Mar 20 '23 at 04:51
  • 1
    @mklement0 Thank you for your input. I will try what you suggested tomorrow. – SyncErr0r Mar 20 '23 at 04:52
0

The setting you want is

$settings.NewLineOnAttributes = $true

See this dotnetfiddle which does the same thing in C#.

Charlieface
  • 52,284
  • 6
  • 19
  • 43
0

After more experimentation, I settled on the solution mklement0 provided and made some tweaks:

$xmlPath = "C:\path\to\web.xml"
$newLine = [Environment]::NewLine
$xmlnsPattern = '\s+xmlns\s*=\s*""\s*'

(Get-Content -Path $xmlPath -Encoding utf8NoBOM) | ForEach-Object {
    if ($_ -match '^<web-app ') {
        $_ -replace '\s(?<=" )', "$newLine  " `
           -replace '(&#xD;)?&#xA;', $newLine `
           -replace 'version="4.0">', "version=`"4.0`">$newLine"
    } elseif ($_ -match $xmlnsPattern) {
        $_ -replace $xmlnsPattern, ''
    } else {
        $_
    }
} | Set-Content -Path $xmlPath -Encoding utf8NoBOM -Force

Aside from some minor cosmetic changes, the tweaks I made remove the single whitespace character left over after post-processing. A new line is inserted after the opening <web-app> tag to maintain legibility after I insert a new element node. The last change removes the annoying xmlns="" that appears after the opening new element tag. It appears because I use the ImportNode method to add my new element node from XML fragment. AFAIK, there is no way to include the document namespace on the imported fragment. (If there is, I'd like to know!) I could have included the namespace info in the fragment, but it's ugly and is more lines of code.

A quick note: utf8NoBOM for Get-Content and Set-Content encoding is not available in PS 5.1.

SyncErr0r
  • 53
  • 5