2

I have two XML files: A.xml and B.xml.
Each XML contains number of events like that:

  <Event>
    <EventData Name="Time">09/10/2017 12:54:16</EventData>
    <EventData Name="WorkstationName">USER2-PC</EventData>
    <EventData Name="UserName">user2</EventData>
  </Event>  

The values are examples.

I want to search if there is an event in A.xml that has the same "WorkstationName" and "UserName" values in B.xml.

For example, this is the XMLs:
A.xml

<?xml version="1.0" encoding="UTF-8"?>
<Events>
  <Event>
    <EventData Name="Time">09/10/2017 12:54:16</EventData>
    <EventData Name="WorkstationName">USER2-PC</EventData>
    <EventData Name="UserName">user2</EventData>
  </Event>  
</Events>

B.xml

<?xml version="1.0" encoding="UTF-8"?>
<Events>
   <Event>
    <EventData Name="Time">09/10/2017 14:54:16</EventData>
    <EventData Name="WorkstationName">USER1-PC</EventData>
    <EventData Name="UserName">user1</EventData>
  </Event>
  <Event>
    <EventData Name="Time">09/10/2017 13:54:16</EventData>
    <EventData Name="WorkstationName">USER2-PC</EventData>
    <EventData Name="UserName">user2</EventData>
  </Event>    
</Events>

Expected result:

  <Event>
    <EventData Name="Time">09/10/2017 13:54:16</EventData>
    <EventData Name="WorkstationName">USER2-PC</EventData>
    <EventData Name="UserName">user2</EventData>
  </Event> 

I wrote a code that does it:
The algorithm:

$fileA = "C:\tmp\A.xml"
$fileB = "C:\tmp\B.xml"

$a = New-Object Xml.XmlDocument
$a.Load($fileA)

$b = New-Object Xml.XmlDocument
$b.Load($fileB)

$pc = ($event.EventData | Where-Object {$_.Name -eq "WorkstationName"})."#text"
$username = ($event.EventData | Where-Object {$_.Name -eq "UserName"})."#text"

$result = $b.Events.Event | Where-Object {

(($_.EventData | where-object {$_.Name -eq "WorkstationName"})."#text" -eq $pc) -and 
(($_.EventData | where-object {$_.Name -eq "UserName"})."#text" -eq $username)

}

$result.EventData

The problem is when I work with large B.XML file (~250,000 lines).
I wrote a code that will create both XML examples (small A.XML file and large B.XML file):

function createXMLFiles($numberOfLinesToCreateInB){
    $legitXmlPrefix = @(0xef, 0xbb, 0xbf, 0x3c, 0x3f, 0x78, 0x6d, 0x6c, 0x20, 0x76, 0x65, 0x72, 0x73, 0x69, 0x6f, 0x6e, 0x3d, 0x22, 0x31, 0x2e, 0x30, 0x22, 0x20, 0x65, 0x6e, 0x63, 0x6f, 0x64, 0x69, 0x6e, 0x67, 0x3d, 0x22, 0x75, 0x74, 0x66, 0x2d, 0x38, 0x22, 0x3f, 0x3e, 0x0d, 0x0a, 0x3c, 0x45, 0x76, 0x65, 0x6e, 0x74, 0x73, 0x3e, 0x0d, 0x0a)
    $XMLEnd = @(0x0d, 0x0a, 0x3c, 0x2f, 0x45, 0x76, 0x65, 0x6e, 0x74, 0x73, 0x3e)
    $enc = [system.Text.Encoding]::UTF8

    $aXML = @"
  <Event>
    <EventData Name="Time">09/10/2017 12:54:16</EventData>
    <EventData Name="WorkstationName">USER2-PC</EventData>
    <EventData Name="UserName">user2</EventData>
  </Event>  
"@

    $data1 = $enc.GetBytes($aXML) 
    $newXmlFile = "c:\tmp\A.xml"
    $newArr = $legitXmlPrefix + $data1 + $XMLEnd
    [io.file]::WriteAllBytes($newXmlFile, $newArr)

    $bXML = @"
   <Event>
    <EventData Name="Time">09/10/2017 14:54:16</EventData>
    <EventData Name="WorkstationName">USER1-PC</EventData>
    <EventData Name="UserName">user1</EventData>
  </Event>
  <Event>
    <EventData Name="Time">09/10/2017 13:54:16</EventData>
    <EventData Name="WorkstationName">USER2-PC</EventData>
    <EventData Name="UserName">user2</EventData>
  </Event>    
"@


    $newXmlFile = "c:\tmp\B.xml"
    $data1 = $enc.GetBytes($bXML) 
    $newArr = $legitXmlPrefix

    $additionals = @"
   <Event>
    <EventData Name="Time">09/10/2017 14:54:16</EventData>
    <EventData Name="WorkstationName">USER1-PC</EventData>
    <EventData Name="UserName">user1</EventData>
  </Event>    `n
"@


    $data2 = $enc.GetBytes($additionals) 
    if($numberOfLinesToCreateInB -gt 0){
        $data2 = $data2 * $numberOfLinesToCreateInB
        $newArr += $data2 
    }
    $newArr += $data1
    $newArr += $XMLEnd
    [io.file]::WriteAllBytes($newXmlFile, $newArr)
}

createXMLFiles 50000

If you will run the algorithm I wrote you will see that it takes long time to find the event from A.XML in B.XML.
This is because the event in A.XML is the last event in B.XML so only when it will meet the last node at B.XML it will finish.

Is there an option to make it more efficient ?
I though maybe to use multi-thread by dividing the sections: One thread will search between event 0..1000, second will search 1001..2000, etc.
But maybe you have better solution.

References:
How can i use XmlReader in PowerShell to stream big/huge XML files?
How to create a new System.Xml.Linq.XElement with PowerShell

EDIT (FAST):
I tried with XPATH. It still work slow when passing over all the events:

Select-Xml -Path $fileB -XPath "/Events/Event" | Where-Object {

   (($events[0].Node.EventData | where-object {$_.Name -eq "WorkstationName"})."#text" -eq $pc) -and 
   (($events[0].Node.EventData | where-object {$_.Name -eq "UserName"})."#text" -eq $username)

}

Following @Tomalak suggestion, I removed almost all the Where-Object pipelines and things started to work much faster.

Measure-Command {
    Select-Xml -Path $fileB -XPath "/Events/Event" | Where-Object {

       ($_.Node.EventData[1]."#text" -eq $pc) -and 
       ($_.Node.EventData[2]."#text" -eq $username)

    }
}

Days              : 0
Hours             : 0
Minutes           : 0
Seconds           : 6
Milliseconds      : 253
Ticks             : 62535333
TotalDays         : 7.23788576388889E-05
TotalHours        : 0.00173709258333333
TotalMinutes      : 0.104225555
TotalSeconds      : 6.2535333
TotalMilliseconds : 6253.5333

Measure-Command {
    $result = $b.Events.Event | Where-Object {
       ($_.EventData[1]."#text" -eq $pc) -and 
       ($_.EventData[2]."#text" -eq $username)
    }
}


Days              : 0
Hours             : 0
Minutes           : 0
Seconds           : 17
Milliseconds      : 700
Ticks             : 177006124
TotalDays         : 0.000204868199074074
TotalHours        : 0.00491683677777778
TotalMinutes      : 0.295010206666667
TotalSeconds      : 17.7006124
TotalMilliseconds : 17700.6124

It works much faster, and also the use of -XPath is faster (6 seconds) than the regular passing way (17 seconds).
The test was on 50,000 events, 250,000 XML lines.

EDIT (VERY FAST):
The finale is like this:

Measure-Command {
    Select-Xml -Path $fileB -XPath "/Events/Event[EventData[@Name = 'WorkstationName'] = '$pc' and EventData[@Name = 'UserName'] = '$username']" 
}

Days              : 0
Hours             : 0
Minutes           : 0
Seconds           : 0
Milliseconds      : 609
Ticks             : 6099484
TotalDays         : 7.05958796296296E-06
TotalHours        : 0.000169430111111111
TotalMinutes      : 0.0101658066666667
TotalSeconds      : 0.6099484
TotalMilliseconds : 609.9484

Which make it very fast (less than 1 second !).

E235
  • 11,560
  • 24
  • 91
  • 141
  • First, measure how long loading the XML file takes. Just the call to `$b.Load($fileB)`. If that is acceptable, switch to using XPath for finding your node, instead of a Powershell loop and see if that improves things. If even the loading time itself is unacceptable, switch to [XMLReader](https://msdn.microsoft.com/en-us/library/system.xml.xmlreader(v=vs.110).aspx) for processing the file, like you've already suggested ([C# examples](https://stackoverflow.com/q/2441673/18771), [Powershell example](https://stackoverflow.com/q/26820590/18771)). – Tomalak Sep 21 '17 at 00:14
  • @Tomalak, Loading the large XML file with `Load()` is fast (1 second, even less). – E235 Sep 21 '17 at 07:18
  • @Tomalak I tried with XPath, it still work slow when passing all the events: `Select-Xml -Path $fileB -XPath "/Events/Event" | Where-Object { (($events[0].Node.EventData | where-object {$_.Name -eq "WorkstationName"})."#text" -eq $pc) -and (($events[0].Node.EventData | where-object {$_.Name -eq "UserName"})."#text" -eq $username)` – E235 Sep 21 '17 at 08:46
  • 1
    No. Avoid all `where-object` pipeline. Work with XPath alone. – Tomalak Sep 21 '17 at 08:56
  • @Tomalak I can't avoid **all** the `Where-Object` pipelines because I need to compare between fields. But after I removed the couple and make it like this: `Select-Xml -Path $fileB -XPath "/Events/Event" | Where-Object { ($_.Node.EventData[1]."#text" -eq $pc) -and ($_.Node.EventData[2]."#text" -eq $username) }` It worked much faster, took it 5-6 seconds to finish ! Btw, if I used `$b.Events.Event | Where-Object {..}` It took 17 seconds. – E235 Sep 21 '17 at 09:07
  • 1
    You can collapse all of that into a single XPath: `"/Events/Event[EventData[@Name = 'WorkstationName'] = '$pc' and EventData[@Name = 'UserName'] = '$username']"` - this selects any `` matching the `$pc` and `$username`, respectively. When you append `/EventData[@Name = 'Time']/text()` to that path it would only select the time values of such events. – Tomalak Sep 21 '17 at 09:12
  • @Tomalak Wow, it took less than a second ! Please submit it as an answer and I will mark it. Btw, is it possible to use regex ? Like, if I want `EventData[@Name = 'UserName'] -like '$username']` – E235 Sep 21 '17 at 09:23
  • 1
    No, actual regex are not possible, because .NET only supports XPath 1.0 natively, and regex is not a part of that standard. They have been introduced in XPath 2.0 though, you might be able to find stand-alone command line tools that support this version. But it depends on what you want to do, because there *are* basic string functions in XPath 1.0 (`substring()`, `substring-before()`, `substring-after()`, `starts-with()`, `contains()`) that might be sufficient for what you want. (Careful, [the docs](https://www.w3.org/TR/xpath/#section-String-Functions) also mention XPath 2.0 functions.) – Tomalak Sep 21 '17 at 09:28
  • @Tomalak Great. I will check them. You really help me with this. It is super fast :) – E235 Sep 21 '17 at 09:30
  • You can also select "a little too many" XML nodes with a less specific XPath and use a final Powershell `Where-Object` to filter them down, but performance-wise it will best if you can manage pick the right nodes with XPath entirely. – Tomalak Sep 21 '17 at 09:33
  • @Tomalak Yes, this is what I am doing now. It also work very fast so I don't see any reason to use the `-XPath` base string functions. – E235 Sep 21 '17 at 09:46
  • Thanks for posting the performance measurements. :) – Tomalak Sep 21 '17 at 10:04

1 Answers1

1

The slow part of this approach is the complex PowerShell Where-Object pipeline. This is not very efficient with big input documents because it involves the creation of a whole truckload of Powershell-specific, temporary wrapper objects that Where-Object can actually use.

The tool best-suited for efficiently picking specific nodes from an XML-document is XPath. The Select-Xml cmdlet lets you run an XPath filter against an XML document.

Your code:

$result = $b.Events.Event | Where-Object {
    (($_.EventData | where-object {$_.Name -eq "WorkstationName"})."#text" -eq $pc) -and 
    (($_.EventData | where-object {$_.Name -eq "UserName"})."#text" -eq $username)    
}

translates to English as follows:

  • of all the <Event> nodes in <Events>, pick the ones that
    • have an <EventData> with a @Name of WorkstationName and a value of $pc AND
    • have an <EventData> with a @Name of UserName and a value of $username

which translates to XPath straight-forwardly as follows:

$events_by_user_and_pc = "
    /Events/Event[
        EventData[@Name = 'WorkstationName'] = '$pc' and
        EventData[@Name = 'UserName'] = '$username'
    ]
"

$result = Select-Xml -Path $fileB -XPath $events_by_user_and_pc

...which looks pretty much exactly like the Powershell code. XPath doesn't care about newlines and whitespace, so you can format it nicely.


String functions can be used to make partial comparisons. A path that selects all events for a certain user on any PC whose @WorkstationName starts with USER2 would look like this:

/Events/Event[
    starts-with(EventData[@Name = 'WorkstationName'], 'USER2') and
    EventData[@Name = 'UserName'] = '$username'
]
Tomalak
  • 332,285
  • 67
  • 532
  • 628