Multi-Thread PowerShell script to extract data faster from large XML file

Question

The below script works as expected to get the desired output, but it takes a long time to process large XML files (2GB and above). Calling on experts for suggestions on how to make it faster by multi threading or using some other technique in powershell script.

Reference Post - to know more about the logic of below script: Parse XML to extract data with grouping in PowerShell

# Create XML object to load data into
$xml = New-Object -TypeName System.Xml.XmlDocument

# Load in XML file
$xml.Load("test.xml")

# Group XML child nodes by Priority
$groups = $xml.'ABC-FOF-PROCESS'.ChildNodes | Group-Object -Property PRIORITY

# Iterate groups and create PSCustomObject for each grouping
& {
    foreach ($group in $groups)
    {
        [PSCustomObject]@{
            PRIORITY = [int]$group.Name
            KEY = ($group.Group.KEY | Select-Object -Unique).Count
            HITS = $group.Count
        }
    }
} | Sort-Object -Property PRIORITY -Descending | Out-File -FilePath output.txt
# Pipe output here

Output:

PRIORITY KEY HITS
-------- --- ----
       1   1    1
      -3   2    2
     -14   2    3

xml:

<ABC-FOF-PROCESS>
<H>
 <PRIORITY>-14</PRIORITY>
 <KEY>F637A146-3437AB82-BA659D4A-17AC7FBF</KEY>
</H>
<H>
 <PRIORITY>-14</PRIORITY>
 <KEY>F637A146-3437AB82-BA659D4A-17AC7FBF</KEY>
</H>
<H>
 <PRIORITY>-3</PRIORITY>
 <KEY>D6306210-CF424F11-8E2D3496-E6CE1CA7</KEY>
</H>
<H>
 <PRIORITY>1</PRIORITY>
 <KEY>D6306210-CF424F11-8E2D3496-E6CE1CA7</KEY>
</H>
<H>
 <PRIORITY>-3</PRIORITY>
 <KEY>4EFR02B4-ADFDAF12-3C123II2-ADAFADFD</KEY>
</H>
<H>
 <PRIORITY>-14</PRIORITY>
 <KEY>5D2702B2-ECE8F1FB-3CEC3229-5FE4C4BC</KEY>
</H>
</ABC-FOF-PROCESS>

How long does it take now? What kind of time would be acceptable? — vonPryz, Apr 22 '20 at 06:50
It takes around 9 hrs to process 3 GB file using this script , main objective of solving this problem is to find out the statistics using the script in a short time most likely within 15 mins, rather than loading the XML into DB and then using sql to find statistics (takes around 12 hrs to load the file into DB). — Naga, Apr 22 '20 at 07:08
Convert the whole XML as a CSV with columns for priority and key. Load into a database and do summing in SQL. Don't use XML data type within database, it's seldom efficient. — vonPryz, Apr 22 '20 at 07:47
@vonPryz I do not want to load millions of records into DB to get this stat, instead I need to get the stats on the fly, thus DB housekeeping will be reduced. Since this script is not finalized yet, till now we are loading up the XML to DB, but unfortunately we can't convert XML to CSV due to software restrictions we have in place for this solution - which would be more efficient as you said. — Naga, Apr 22 '20 at 18:28

Lieven Keersmaekers · Accepted Answer · 2020-04-22T13:21:26.303

3

If your xml is a fixed format, you could read the file line by line and adjust the results as you go.

It's not parallel, it's not as robust as using the xml parsing capabilities and it will not win any beauty awards but it should be pretty fast.

$hits = @{} # Hashtable containing number of hits per priority
$keys = @{} # Hashtable containing unique keys per priority
switch -Regex -File $env:temp\test.xml
{
    '^\s+<PRIORITY>(?<priority>[-]?\d+)'
    {
        $currentPriority = $matches.Priority
        $hits[$currentPriority] = $hits[$currentPriority]+1
        continue
    }
    '^\s+<KEY>(?<key>[\w-]+)'
    {
        $currentKey = $matches.Key
        if ($keys[$currentPriority] -eq $null) {$keys[$currentPriority] = @{}}
        $keys[$currentPriority][$currentKey] = $null
    }
}

$hits.GetEnumerator() | % {
    [PSCustomObject]@{
        PRIORITY = [int]$_.Key
        KEY = $keys[$_.Key].Count
        HITS = [int]$_.Value
    }
} | Sort PRIORITY -Descending

Tested on a 500MB xml

PRIORITY KEY    HITS
-------- ---    ----
       1   1 1000000
      -3   2 2000000
     -14   2 3000000

$timer

IsRunning Elapsed          ElapsedMilliseconds ElapsedTicks
--------- -------          ------------------- ------------
    False 00:02:25.7186698              145718    413249113

edited Apr 22 '20 at 13:21

answered Apr 22 '20 at 09:28

Lieven Keersmaekers

57,207
13
112
146

Thanks for your answer, I would test this shortly and I shall share my feedback. – Naga Apr 22 '20 at 18:29
I was able to get the script up and running with a little tweak in the code. The script you've shared was helpful in solving unprecedented issues where the input XML had control characters like STX.etc , as the solution was based out of regex it was beneficial. And again the script was pretty fast in getting the job done. – Naga Apr 26 '20 at 19:17
Earlier, I had to stick to a backup plan to rely on python script as it was traditionally known to be a efficient , but in my case - your script worked for me. Thanks! – Naga Apr 26 '20 at 19:20
1

Tested with a large XML file (2.4 GB) and it gets completed by 16 mins which is amazing. – Naga Apr 28 '20 at 03:06
@Naga - Thanks for the feedback. Can you share what the issue was so I can improve the answer? – Lieven Keersmaekers Apr 28 '20 at 06:12
1

one of my keytest case which is focused on logic of the script was failing - details of the test case is updated in my other post https://stackoverflow.com/questions/61339705/parse-xml-to-extract-data-with-grouping-in-powershell/61340653?noredirect=1#comment108545075_61340653. To solve this I pushed KEY and priority to a hash-table, then again KEY and hits to another hash-table and i had loop thru both the tables to group and get the maximum priority and sum of the hits. – Naga Apr 28 '20 at 06:58
looping thru hastables indeed took a toll on performance - after finalizing the logic now the time to complete the script is around 35 mins for 2.4 GB file, if you help on improving my logic that would be great. – Naga Apr 28 '20 at 07:00
is it possible to write and append output to a .txt in when the script is in-progress? – Naga Apr 30 '20 at 14:53
Yes but that would get pretty convulated – Lieven Keersmaekers Apr 30 '20 at 17:17

iRon · Answer 2 · 2020-04-22T10:46:11.070

I guess this one of the examples where the focus has been on a single command (Runtime of Foreach-Object vs Foreach loop) rather then to complete solution.

In general, I would recommend you to look at your whole solution and not just to a single statements as the performance of a complete (PowerShell) solution is supposed to be better than the sum of its parts.

In your case, if you need to instantiate a script and invoke that with the Call Operator & just because you want to use the Foreach statement, you probably aim beyond goal:

For the small file you supplied,
This (using the pipeline with the ForEach-Object):

$groups | ForEach-Object {
    [PSCustomObject]@{
        PRIORITY = [int]$_.Name
        KEY = ($_.Group.KEY | Select-Object -Unique).Count
        HITS = $_.Count
    }
} | Sort-Object -Property PRIORITY -Descending # | Out-File -FilePath output.txt

Appears generally faster then this (using the ForEach Statement and a the Call Operator):

& {
    foreach ($group in $groups)
    {
        [PSCustomObject]@{
            PRIORITY = [int]$group.Name
            KEY = ($group.Group.KEY | Select-Object -Unique).Count
            HITS = $group.Count
        }
    }
} | Sort-Object -Property PRIORITY -Descending | Out-File -FilePath output.txt

Due to the nature of the Sort-Object cmdlet (all objects are required to be able to sort it) it will need to stall the pipeline to reorder them, for the same reason, a multi threading approach will probably not make much sense.

Thanks for your answer, I would test this shortly and I shall share my feedback. — Naga, Apr 22 '20 at 18:29

Multi-Thread PowerShell script to extract data faster from large XML file

2 Answers2