Removing nodes from a large XML file using PowerShell

Question

I'm in a situation where I need to take an XML file and remove a bunch of unnecessary nodes but because the file I've been supplied with is around 1.6GB so it's not really feasible to use something like XmlDocument.Load as it'd be very resource heavy.

Given this, I have been trying to solve my issue using both $reader = [System.Xml.XmlReader]::Create($path) and $writer = [System.Xml.XmlWriter]::Create("C:\test\123.xml")

In order to try and remove unnecessary items I tried the following:

# Set the path to your XML file
$path = "C:\test\test.xml"

# Create an XmlReader object to read the file
$reader = [System.Xml.XmlReader]::Create($path)

# Create an XmlWriter object to write the modified XML
$writer = [System.Xml.XmlWriter]::Create("C:\test\123.xml")

# Create a namespace manager and add the namespace prefix and URI
$nsManager = New-Object System.Xml.XmlNamespaceManager($reader.NameTable)
$nsManager.AddNamespace("g", "http://base.google.com/ns/1.0")

# Loop through the XML and remove unwanted nodes
while ($reader.Read()) {
    if ($reader.NodeType -eq "Element") {
      if ($reader.LocalName -eq "Item") {
         # Enter the Item element and loop through its child nodes
            $itemDepth = $reader.Depth
            while ($reader.Read() -and $reader.Depth -gt $itemDepth) {
                Write-Output $reader.LocalName
                # Remove unwanted child nodes of Item element
                if ($reader.NodeType -eq "Element" -and $reader.LocalName -eq "description") {
                    Write-Output Skip
                    $reader.Skip()
                } else {
                    # Write the node to the output file
                    $writer.WriteNode($reader, $false)
                }
            }
      } else{
        $writer.WriteNode($reader, $false)
      }
    }
}

# Clean up
$reader.Close()
$writer.Close()

This approach was maybe 50% of the way there, but the issue I have is that when the parent node is written, it also writes all the children. The inner logic does work but if I remove the outer else it does not create the root of the document so I get an error about invalid XML.

As you'll see below it essentially gets to <channel> and copies everything in between.

For reference I have included a scaled down version of the XML file I've been using.

<?xml version="1.0" encoding="UTF-8" ?>
<rss version="2.0" xmlns:g="http://base.google.com/ns/1.0" xmlns:c="http://base.google.com/cns/1.0">
    <channel>
        <title>Title</title>
        <link>https://site.test</link>
        <description date="2023-03-07 12:15:08">Some description of my feed.</description>
        <item>
            <g:id>1234-5678-9876</g:id>
            <title>Title</title>
            <description>Description</description>
            <link></link>
            <g:price>146.00 GBP</g:price>
            <g:sale_price>48.70 GBP</g:sale_price>
            <g:google_product_category>Clothing</g:google_product_category>
            <g:product_type>Clothing</g:product_type>
            <g:brand>Jayley</g:brand>
            <g:condition>new</g:condition>
            <g:age_group>Adult</g:age_group>
            <g:color>Lilac</g:color>
            <g:gender>Female</g:gender>
            <g:pattern>Striped</g:pattern>
            <g:size>One Size</g:size>
            <g:item_group_id>5f5a22dbb7c91</g:item_group_id>
            <g:custom_label_0>Womens</g:custom_label_0>
            <g:shipping>
                <g:country>GB</g:country>
                <g:service>Standard Delivery</g:service>
                <g:price>1.99 GBP</g:price>
            </g:shipping>
            <c:count type="string">1</c:count>
        </item>
        <item>
            ...
        </item>
    </channel>
</rss>

Also, for reference, if I remove the outer else you can see it does loop through the children but then the XML is invalid.

So, based on your example you just want `Description` be removed and everything else should be copied as-is? — zett42, Mar 07 '23 at 23:18
Yes, I literally just want to prune some redundant information. — Jesse Luke Orange, Mar 08 '23 at 09:33

score 3 · Answer 1 · answered Mar 07 '23 at 22:26

3

I can't help with the technology you're using, but you could do it with a streaming XSLT 3.0 transformation like this:

<xsl:transform version="3.0"
               xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:mode streamable="yes" on-no-match="shallow-copy"/>
  <xsl:template match="item/description"/>
</xsl:transform>

answered Mar 07 '23 at 22:26

Michael Kay

156,231
11
92
164

1

Nice, but AFAIK .NET (and PowerShell) only support XSLT 1.0 out-of-the-box. This [answer](https://stackoverflow.com/a/1533114/7571258) lists two external libraries that support XSLT 3.0. – zett42 Mar 08 '23 at 01:32
1

Microsoft are 20 years behind the curve with their XML technology. If you're doing anything serious with XML on Microsoft platforms, you need to think out-of-the-box. – Michael Kay Mar 08 '23 at 11:16

zett42 · Accepted Answer · 2023-03-09T14:33:54.173

The problem is that $writer.WriteNode($reader, $false) processes the current element of the reader recursively. It advances the reader position past the current element.

So WriteNode() is useless to write XML nodes that should not be completely copied from the input to the output XML. Instead, use the more specific XmlWriter methods WriteStartElement, WriteStartAttribute, WriteString and WriteEndAttribute to build output elements piece-wise.

This example removes Description elements that are children of Item.

$inputPath  = 'input.xml'
$outputPath = 'output.xml'

# Create absolute, native paths for .NET API (which doesn't respect PowerShell's current directory)
$fullInputPath = Convert-Path -LiteralPath $inputPath
$fullOutputPath = (New-Item $outputPath -ItemType File -Force).FullName

$reader = $writer = $null

# Hashtable that stores the path segments that lead to the current element
$elementPath = @{}

try {
    # Create an XmlReader for the input file
    $reader = [Xml.XmlReader]::Create( $fullInputPath )

    # Create an XmlWriter for the output file
    $writer = [Xml.XmlWriter]::Create( $fullOutputPath )

    # Read first node (XML declaration)
    $null = $reader.Read()

    while( -not $reader.EOF ) {

        if( $reader.NodeType -eq [Xml.XmlNodeType]::Element ) {

            # Keep track of where we are in the element tree
            $elementPath[ $reader.Depth ] = $reader.Name

            # If current element is 'Description' and its parent is 'Item', skip it
            if( $reader.Name -eq 'Description' -and $elementPath[ $reader.Depth - 1 ] -eq 'Item' ) {
                # Skip current element
                $reader.Skip()

                # Skip any whitespace after element to avoid empty line in output
                while( -not $reader.EOF -and $reader.NodeType -eq [Xml.XmlNodeType]::Whitespace ) {
                    $reader.Skip()
                }   

                continue
            }

            # Write the start tag of current element
            $writer.WriteStartElement( $reader.Prefix, $reader.LocalName, $reader.NamespaceUri )
            
            if( $reader.HasAttributes ) {
                # Write the attributes of current element
                while( $reader.MoveToNextAttribute() ) {
                    $writer.WriteStartAttribute( $reader.Prefix, $reader.LocalName, $reader.NamespaceUri )
                    $writer.WriteString( $reader.Value )
                    $writer.WriteEndAttribute()
                }                
            }

            # Read next node
            $null = $reader.Read()
        }
        else {
            # If NodeType is EndElement, it writes the end tag.
            # Otherwise it copies any non-element node. 
            # Advances reader position as well!
            $writer.WriteNode( $reader, $false )
        }
    }    
}
finally {
    # Cleanup
    $reader, $writer | ForEach-Object Dispose
}

Thanks very much for this, it didn't occur to me that you can build out the top and bottom and just filter the middle. Using your example I made a version that exactly fits my needs. I'll post it as an answer. — Jesse Luke Orange, Mar 09 '23 at 23:32

Jesse Luke Orange · Answer 3 · 2023-03-10T10:04:00.033

Using the answer provided by @zett42 I made a version that deals with multiple unwanted nodes. Obviously I'm assuming in this example that everything is three levels deep but this could be adjusted.

Anyway, this is what I ended on.

$timer = Measure-Command {
    $inputPath  = 'C:\test\test.xml'
    $outputPath = 'C:\test\test99.xml'

    $fullInputPath = Convert-Path -LiteralPath $inputPath
    $fullOutputPath = (New-Item $outputPath -ItemType File -Force).FullName

    $reader = $writer = $null

    try {
        # Create an XmlReader for the input file
        $reader = [Xml.XmlReader]::Create( $fullInputPath )

        # Create an XmlWriter for the output file
        $writer = [Xml.XmlWriter]::Create( $fullOutputPath )

        # Read first node (XML declaration)
        $null = $reader.Read()

        while( -not $reader.EOF ) {

            if( $reader.NodeType -eq [Xml.XmlNodeType]::Element ) {

                # Define an array of node names to skip
                $skipNodes = @(
                    'g:condition',
                    'g:material',
                    'g:age_group',
                    'g:gender',
                    'g:pattern',
                    'g:mpn',
                    'g:shipping',
                    'g:custom_label_0',
                    'g:custom_label_1',
                    'g:custom_label_2',
                    'g:custom_label_3',
                    'g:custom_label_4',
                    'g:custom_label_5',
                    'g:country',
                    'g:service',
                    'g:promotion_id',
                    'g:product_highlight',
                    'c:shopping_spend',
                    'c:fs_data_opti',
                    'c:fs_date_of_birth',
                    'c:fs_data_original_id',
                    'c:fs_data_original_title',
                    'c:sales_feature',
                    'c:stock',
                    'c:google_product_name',
                    'c:count',
                    'c:order_number'
                )

                # Should the current element be removed from the output?
                if ($reader.Depth -eq 3 -and $skipNodes.Contains($reader.Name)) {
                    $reader.Skip()

                    # Read the next line to check if it's blank
                    $nextLine = $reader.ReadOuterXml()

                    # If the next line is blank, replace it with an empty string
                    if ($nextLine -match '^\s*$') {
                        $output = $output -replace "$nextLine", ''
                    }

                    continue
                }

                $writer.WriteStartElement( $reader.Prefix, $reader.LocalName, $reader.NamespaceUri )
                
                if( $reader.HasAttributes ) {
                    # Copy attributes
                    while( $reader.MoveToNextAttribute() ) {
                        $writer.WriteStartAttribute( $reader.Prefix, $reader.LocalName, $reader.NamespaceUri )
                        $writer.WriteString( $reader.Value )
                        $writer.WriteEndAttribute()
                    }                
                }

                # Read next node
                $null = $reader.Read()
            }
            else {
                # Copy any non-element node. Advances reader position as well!
                $writer.WriteNode( $reader, $false )
            }
        }    
    }
    finally {
        # Cleanup
        $reader, $writer | ForEach-Object Dispose
    }
}

"Elapsed time: $($timer.TotalSeconds) seconds"

If anyone has any feedback I'm all ears.

For reference this managed to reduce the file I had originally by about 60% and took around 120 seconds to run.

Version 2 with HashSet

$timer = Measure-Command {
    $inputPath  = 'C:\test\test.xml'
    $outputPath = 'C:\test\test99.xml'

    $fullInputPath = Convert-Path -LiteralPath $inputPath
    $fullOutputPath = (New-Item $outputPath -ItemType File -Force).FullName

    $reader = $writer = $null

    # Define an array of node names to skip
    $skipNodes = @(
        'g:condition',
        'g:material',
        'g:age_group',
        'g:gender',
        'g:pattern',
        'g:mpn',
        'g:shipping',
        'g:custom_label_0',
        'g:custom_label_1',
        'g:custom_label_2',
        'g:custom_label_3',
        'g:custom_label_4',
        'g:custom_label_5',
        'g:country',
        'g:service',
        'g:promotion_id',
        'g:product_highlight',
        'c:shopping_spend',
        'c:fs_data_opti',
        'c:fs_date_of_birth',
        'c:fs_data_original_id',
        'c:fs_data_original_title',
        'c:sales_feature',
        'c:stock',
        'c:google_product_name',
        'c:count',
        'c:order_number'
    )

    # HashSet is faster for lookup.
    $skipNodesHash = [Collections.Generic.HashSet[string]] $skipNodes

    try {
        # Create an XmlReader for the input file
        $reader = [Xml.XmlReader]::Create( $fullInputPath )

        # Create an XmlWriter for the output file
        $writer = [Xml.XmlWriter]::Create( $fullOutputPath )

        # Read first node (XML declaration)
        $null = $reader.Read()

        while( -not $reader.EOF ) {

            if( $reader.NodeType -eq [Xml.XmlNodeType]::Element ) {

                # Should the current element be removed from the output?
                if ($reader.Depth -eq 3 -and $skipNodesHash.Contains($reader.Name)) {
                    $reader.Skip()

                    # Read the next line to check if it's blank
                    $nextLine = $reader.ReadOuterXml()

                    # If the next line is blank, replace it with an empty string
                    if ($nextLine -match '^\s*$') {
                        $output = $output -replace "$nextLine", ''
                    }

                    continue
                }

                $writer.WriteStartElement( $reader.Prefix, $reader.LocalName, $reader.NamespaceUri )
                
                if( $reader.HasAttributes ) {
                    # Copy attributes
                    while( $reader.MoveToNextAttribute() ) {
                        $writer.WriteStartAttribute( $reader.Prefix, $reader.LocalName, $reader.NamespaceUri )
                        $writer.WriteString( $reader.Value )
                        $writer.WriteEndAttribute()
                    }                
                }

                # Read next node
                $null = $reader.Read()
            }
            else {
                # Copy any non-element node. Advances reader position as well!
                $writer.WriteNode( $reader, $false )
            }
        }    
    }
    finally {
        # Cleanup
        $reader, $writer | ForEach-Object Dispose
    }
}

"Elapsed time: $($timer.TotalSeconds) seconds"

Great that my answer enabled you to build a solution on your own. Just one thing to try for performance improvement: Move the definition of `$skipNodes` before the loop and make it a [`HashSet`](https://learn.microsoft.com/en-us/dotnet/api/system.collections.generic.hashset-1?view=net-7.0) for faster lookup: `$skipNodes = [Collections.Generic.HashSet[string]] @('g:condition', 'g:material', …)`. — zett42, Mar 09 '23 at 23:46

Removing nodes from a large XML file using PowerShell

3 Answers3