I'm in a situation where I need to take an XML file and remove a bunch of unnecessary nodes but because the file I've been supplied with is around 1.6GB so it's not really feasible to use something like XmlDocument.Load
as it'd be very resource heavy.
Given this, I have been trying to solve my issue using both $reader = [System.Xml.XmlReader]::Create($path)
and $writer = [System.Xml.XmlWriter]::Create("C:\test\123.xml")
In order to try and remove unnecessary items I tried the following:
# Set the path to your XML file
$path = "C:\test\test.xml"
# Create an XmlReader object to read the file
$reader = [System.Xml.XmlReader]::Create($path)
# Create an XmlWriter object to write the modified XML
$writer = [System.Xml.XmlWriter]::Create("C:\test\123.xml")
# Create a namespace manager and add the namespace prefix and URI
$nsManager = New-Object System.Xml.XmlNamespaceManager($reader.NameTable)
$nsManager.AddNamespace("g", "http://base.google.com/ns/1.0")
# Loop through the XML and remove unwanted nodes
while ($reader.Read()) {
if ($reader.NodeType -eq "Element") {
if ($reader.LocalName -eq "Item") {
# Enter the Item element and loop through its child nodes
$itemDepth = $reader.Depth
while ($reader.Read() -and $reader.Depth -gt $itemDepth) {
Write-Output $reader.LocalName
# Remove unwanted child nodes of Item element
if ($reader.NodeType -eq "Element" -and $reader.LocalName -eq "description") {
Write-Output Skip
$reader.Skip()
} else {
# Write the node to the output file
$writer.WriteNode($reader, $false)
}
}
} else{
$writer.WriteNode($reader, $false)
}
}
}
# Clean up
$reader.Close()
$writer.Close()
This approach was maybe 50% of the way there, but the issue I have is that when the parent node is written, it also writes all the children. The inner logic does work but if I remove the outer else
it does not create the root of the document so I get an error about invalid XML.
As you'll see below it essentially gets to <channel>
and copies everything in between.
For reference I have included a scaled down version of the XML file I've been using.
<?xml version="1.0" encoding="UTF-8" ?>
<rss version="2.0" xmlns:g="http://base.google.com/ns/1.0" xmlns:c="http://base.google.com/cns/1.0">
<channel>
<title>Title</title>
<link>https://site.test</link>
<description date="2023-03-07 12:15:08">Some description of my feed.</description>
<item>
<g:id>1234-5678-9876</g:id>
<title>Title</title>
<description>Description</description>
<link></link>
<g:price>146.00 GBP</g:price>
<g:sale_price>48.70 GBP</g:sale_price>
<g:google_product_category>Clothing</g:google_product_category>
<g:product_type>Clothing</g:product_type>
<g:brand>Jayley</g:brand>
<g:condition>new</g:condition>
<g:age_group>Adult</g:age_group>
<g:color>Lilac</g:color>
<g:gender>Female</g:gender>
<g:pattern>Striped</g:pattern>
<g:size>One Size</g:size>
<g:item_group_id>5f5a22dbb7c91</g:item_group_id>
<g:custom_label_0>Womens</g:custom_label_0>
<g:shipping>
<g:country>GB</g:country>
<g:service>Standard Delivery</g:service>
<g:price>1.99 GBP</g:price>
</g:shipping>
<c:count type="string">1</c:count>
</item>
<item>
...
</item>
</channel>
</rss>
Also, for reference, if I remove the outer else you can see it does loop through the children but then the XML is invalid.