1

I am currently trying to remove a large amount of data from a huge XML file. I am currently using Powershell to try do this and I was wondering if its even possbile to do it in a acceptable amount of time. This file contains 2.5m records and I want to remove any records where the attribute = 'COMPANY'. Here is my current code:

$xml = [xml]'' 
$xml.Load("C:\New folder\untrimmed.xml")


$node = $xml.SelectSingleNode("//record[@category='COMPANY']")
while ($node -ne $null) {
    $node.ParentNode.RemoveChild($node)
    $node = $xml.SelectSingleNode("//record[@category='COMPANY']")

$xml.save("C:\New folder\trimmed.xml")

After this is completed after an hour and a half, the trimmed down file is BIGGER in size than the original. How can I do this in a better way? Is powershell not the right tool for the job here?

MR JACKPOT
  • 206
  • 1
  • 3
  • 15
  • Is the trimmed file using UTF-16 encoding? – vonPryz Apr 22 '20 at 13:06
  • The `While` loop isn't closed. Can you supply a [MCVE] with an `XML` example? – iRon Apr 22 '20 at 13:11
  • 1
    I would suggest using System.Xml.XmlReader and XmlWriter and the read elements and stream them out except ones that you want to filter. If I'm not mistaken [xml] reads the whole file in memory. See this SO for reference how to do it: https://stackoverflow.com/questions/48102318/very-large-xml-files-in-powershell – Dmitry Apr 22 '20 at 13:12
  • @vonPryz its UTF-8 – MR JACKPOT Apr 22 '20 at 13:56

1 Answers1

0

Try the new Gizmo tool in Saxon 10.0.

java net.sf.saxon.Gizmo -s:"C:\New folder\untrimmed.xml"
/>delete //record[@category='COMPANY']
/>save C:\New folder\untrimmed.xml
/>quit

Caveat: I haven't tried it with filenames containing spaces.

Gizmo doesn't currently use streaming, unfortunately (we had a prototype that does, but it's not released), so you'll need a fair bit of memory for this to run.

If streaming is essential, you can do it with a streaming XSLT 3.0 stylesheet:

<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
               version="3.0">
  <xsl:mode streamable="yes" on-no-match="shallow-copy"/>
  <xsl:template match="record[@category='COMPANY']"/>
</xsl:transform>
Michael Kay
  • 156,231
  • 11
  • 92
  • 164