2

Hello dear fellow Powershell users,

I'm trying to parse xml files, which can differ in structure. Therefore, I want to access the node values based on the node structure received from a variable.

Example

#XML file
$xml = [xml] @'
<node1>
    <node2>
        <node3>
            <node4>test1</node4>
        </node3>
    </node2>
</node1>
'@

Accessing the values directly works.

#access XML node directly -works-
$xml.node1.node2.node3.node4        # working <OK>

Accessing the values via node information from variable does not work.

#access XML node via path from variable -does not work-
$testnodepath = 'node1.node2.node3.node4'

$xml.$testnodepath                  # NOT working
$xml.$($testnodepath)               # NOT working

Is there a way to access the XML node values directly via receiving node information from a variable?

PS: I am aware, that there is a way via Selectnode, but I assume that is inefficient since it basically searching for keywords.

#Working - but inefficient
$testnodepath = 'node1/node2/node3/node4'
$xml.SelectNodes($testnodepath)

I need a very efficient way of parsing the XML file since I will need to parse huge XML files. Is there a way to directly access the node values in the form $xml.node1.node2.node3.node4 by receiving the node structure from a variable?

  • 1
    *"I assume that is inefficient since it basically searching for keywords."* - Don't assume things. XPath is extremely efficient (compared to anything PowerShell does "natively" with the XML, at the very least.) – Tomalak Feb 24 '22 at 08:29
  • 6
    *"Is there a way to directly access the node values in the form $xml.node1.node2.node3.node4 by receiving the node structure from a variable?"* - This is unclear. What does "directly access" mean? What does "receiving the node structure" mean? What is "huge", exactly? How fast is "very efficient" (and compared to what?) . Show samples of your input, talk file sizes, describe your desired output, show what you have tried, take measurements to give meaning to the words "efficient"/"inefficient". – Tomalak Feb 24 '22 at 08:33
  • Well, if it was unclear, then sorry. Basic question is how can I make this work: ```$testnodepath = 'node1.node2.node3.node4'``` ```$xml.$testnodepath # NOT working``` – stout.johnson Feb 24 '22 at 08:48
  • 1
    Answer the questions I have asked. There are several. – Tomalak Feb 24 '22 at 09:10
  • 1
    Well, "directly access" means via node1.node2.node3.node4 but from a variable. With "receiving the node structure" I mean that the "nodeA.nodeB...." comes from a variable. "Huge" means that I have to process XML files with size up to 100MB, possibly even more in the future. "very efficient" refers to to most efficient way possible, esp. compared to SelectNode. – stout.johnson Feb 24 '22 at 10:25
  • If you need max. performance with big XML files, you might want to take a step back and drop all (DOM) API that require you to read the whole document into memory. Have a look at [`XmlReader`](https://learn.microsoft.com/en-us/dotnet/api/system.xml.xmlreader#examples) which provides only basic parsing abilities but is one of the most performant ways to process XML. You need to build up a path of where you currently are in the document, so you can match your search path "nodeA.nodeB..." to the current location. – zett42 Feb 24 '22 at 11:14
  • Thank you zett42 for that proposal, I will look into XmlReader. Regards, Stout – stout.johnson Feb 24 '22 at 13:21

4 Answers4

2

You might use the ExecutionContext ExpandString for this:

$ExecutionContext.InvokeCommand.ExpandString("`$(`$xml.$testnodepath)")
test1

If the node path ($testnodepath) comes from outside (e.g. a parameter), you might want to prevent any malicious code injections by striping of any character that is not a word character or a dot (.):

$securenodepath = $testnodepath -Replace '[^\w\.]'
$ExecutionContext.InvokeCommand.ExpandString("`$(`$xml.$securenodepath)")
iRon
  • 20,463
  • 10
  • 53
  • 79
1

You can split the string containing the property path into individual names and then dereference them 1 by 1:

# define path
$testnodepath = 'node1.node2.node3.node4'

# create a new variable, this will be our intermediary for keeping track of each node/level we've resolved so far
$target = $xml

# now we just loop through each node name in the path
foreach($nodeName in $testnodepath.Split('.')){
  # keep advancing down through the path, 1 node name at a time
  $target = $target.$nodeName
}

# this now resolves to the same value as `$xml.node1.node2.node3.node4`
$target
Mathias R. Jessen
  • 157,619
  • 12
  • 148
  • 206
  • Thank you for this advice, seems to work also, but it is not a definitive way to resolve the path, but a sort of searching for the node. So I think, it will not be the most efficient way. But thank you very much. It might be very helpful for other situation. Regards, Stout – stout.johnson Feb 24 '22 at 13:13
  • @stout.johnson It _is_ the most efficient way. What do you mean by "not definitive"? – Mathias R. Jessen Feb 24 '22 at 13:17
  • From my understanding it is not the most efficient way. Your solution would be the best, if I did not have the exact path to the node I want. I do have the exact path to the node, it is stored in a variable. I just need to resolve the path from the variable to access the node. Looping through all nodes is not necessary in my case. Hope, that is clear? – stout.johnson Feb 24 '22 at 13:31
  • Looping through all the nodes _is exactly what `$xml.node1.node2.node3.node4` does in the first place_ :) – Mathias R. Jessen Feb 25 '22 at 12:30
  • And is it equally as fast? I will test it, I was under the assumption that the loop must take longer than accessing via "$xml.node1.node2.node3.node4". If my testing shows otherwise your solution might be really good. – stout.johnson Feb 28 '22 at 10:49
1

I will need to parse huge XML files

The following presents a memory-friendly streaming approach, that doesn't require to load the whole XML document (DOM) into memory. So you could parse really huge XML files even if they don't fit into memory. It should also improve parsing speed as we can simply skip elements that we are not interested in. To accomplish this, we use System.Xml.XmlReader to process XML elements on-the-fly, while they are read from the file.

I've wrapped the code in a reusable function:

Function Import-XmlElementText( [String] $FilePath, [String[]] $ElementPath ) {

    $stream = $reader = $null

    try {
        $stream = [IO.File]::OpenRead(( Convert-Path -LiteralPath $FilePath )) 
        $reader = [System.Xml.XmlReader]::Create( $stream )

        $curElemPath = ''  # The current location in the XML document

        # While XML nodes are read from the file
        while( $reader.Read() ) {
            switch( $reader.NodeType ) {
                ([System.Xml.XmlNodeType]::Element) {
                    if( -not $reader.IsEmptyElement ) {
                        # Start of a non-empty element -> add to current path
                        $curElemPath += '/' + $reader.Name
                    }
                }
                ([System.Xml.XmlNodeType]::Text) {
                    # Element text -> collect if path matches
                    if( $curElemPath -in $ElementPath ) {
                        [PSCustomObject]@{
                            Path  = $curElemPath
                            Value = $reader.Value
                        }
                    }
                }
                ([System.Xml.XmlNodeType]::EndElement) {
                    # End of element - remove current element from the path
                    $curElemPath = $curElemPath.Substring( 0, $curElemPath.LastIndexOf('/') ) 
                }
            }
        }
    }
    finally {
        if( $reader ) { $reader.Close() }
        if( $stream ) { $stream.Close() }
    }
}

Call it like this:

Import-XmlElementText -FilePath test.xml -ElementPath '/node1/node2a/node3a', '/node1/node2b'

Given this input XML:

<node1>
    <node2a>
        <node3a>test1</node3a>
        <node3b/>
        <node3c a='b'/>
        <node3d></node3d>
    </node2a>
    <node2b>test2</node2b>
</node1>

This output is produced:

Path                 Value
----                 -----
/node1/node2a/node3a test1
/node1/node2b        test2

Actually the function outputs objects which can be processed by pipeline commands as usual or be stored in an array:

$foundElems = Import-XmlElementText -FilePath test.xml -ElementPath '/node1/node2a/node3a', '/node1/node2b'

$foundElems[1].Value  # Prints 'test2'

Notes:

  • Convert-Path is used to convert a PowerShell path (aka PSPath), which might be relative, to an absolute path that can be used by .NET functions. This is required because .NET uses a different current directory than PowerShell and a PowerShell path can be in a form that .NET doesn't even understand (e. g. Microsoft.PowerShell.Core\FileSystem::C:\something.txt).
  • When encountering start of an element, we have to skip empty elements such as <node/>, because for such elements we don't enter the EndElement case branch, which would render the current path ($curElemPath) invalid (the element would not be removed from the current path again).
zett42
  • 25,437
  • 3
  • 35
  • 72
  • zett42, thank you very much for this detailed answer and function. It looks very promising. I will definitely test it and compare with the solution I have for large files. Your help is very much appreciated. – stout.johnson Feb 24 '22 at 14:24
  • @stout.johnson I just did a quick test with a big XML file. Actually my code is slower than using `[xml]`. Then I rewrote the function using C# and it finally went faster than `[xml]` by some margin (roughly 4x faster, but I didn't measure precisely)! I will update this answer with the C# code later. – zett42 Feb 24 '22 at 17:04
0

I have a similar requirement to this, however, my requirement is to set values referencing nodes using a variable. We need this ability so that we can have one script which can reference different psd1 files and set the information correctly. Hard coding paths mean we need multiple scripts to do the same thing. As you can imagine this is a nightmare.

... The following works.

[XML]$doc = Get-Content $my_xml_file
$xml_cfg = Import-LocalizedData = xml_information.psd1
$xml_path = "FinData.Header.Hdrinfo.From.CpnyId.Id.StoreId.Report.Id"
$doc.FinData.Header.Hdrinfo.From.CpnyId.Id.StoreId.Report.Id = $xml_cfg.from_id

However, this fails: $doc.$xml_path = xml_cfg.from_id

ERROR: "The property 'FinData.Header.Hdrinfo.From.CpnyId.Id.StoreId.Report.Id' cannot be found on this object. Verify that the property exists and can be set."

...

It is a real shame PowerShell cannot handle variable references to objects. Referencing objects using variables works fine in Perl and thanks to these sorts of limitations prevents us from migrating all our code to PowerShell.

Tesh
  • 29
  • 4