I have two XML files: A.xml and B.xml.
Each XML contains number of events like that:
<Event>
<EventData Name="Time">09/10/2017 12:54:16</EventData>
<EventData Name="WorkstationName">USER2-PC</EventData>
<EventData Name="UserName">user2</EventData>
</Event>
The values are examples.
I want to search if there is an event in A.xml that has the same "WorkstationName
" and "UserName"
values in B.xml.
For example, this is the XMLs:
A.xml
<?xml version="1.0" encoding="UTF-8"?>
<Events>
<Event>
<EventData Name="Time">09/10/2017 12:54:16</EventData>
<EventData Name="WorkstationName">USER2-PC</EventData>
<EventData Name="UserName">user2</EventData>
</Event>
</Events>
B.xml
<?xml version="1.0" encoding="UTF-8"?>
<Events>
<Event>
<EventData Name="Time">09/10/2017 14:54:16</EventData>
<EventData Name="WorkstationName">USER1-PC</EventData>
<EventData Name="UserName">user1</EventData>
</Event>
<Event>
<EventData Name="Time">09/10/2017 13:54:16</EventData>
<EventData Name="WorkstationName">USER2-PC</EventData>
<EventData Name="UserName">user2</EventData>
</Event>
</Events>
Expected result:
<Event>
<EventData Name="Time">09/10/2017 13:54:16</EventData>
<EventData Name="WorkstationName">USER2-PC</EventData>
<EventData Name="UserName">user2</EventData>
</Event>
I wrote a code that does it:
The algorithm:
$fileA = "C:\tmp\A.xml"
$fileB = "C:\tmp\B.xml"
$a = New-Object Xml.XmlDocument
$a.Load($fileA)
$b = New-Object Xml.XmlDocument
$b.Load($fileB)
$pc = ($event.EventData | Where-Object {$_.Name -eq "WorkstationName"})."#text"
$username = ($event.EventData | Where-Object {$_.Name -eq "UserName"})."#text"
$result = $b.Events.Event | Where-Object {
(($_.EventData | where-object {$_.Name -eq "WorkstationName"})."#text" -eq $pc) -and
(($_.EventData | where-object {$_.Name -eq "UserName"})."#text" -eq $username)
}
$result.EventData
The problem is when I work with large B.XML file (~250,000 lines).
I wrote a code that will create both XML examples (small A.XML file and large B.XML file):
function createXMLFiles($numberOfLinesToCreateInB){
$legitXmlPrefix = @(0xef, 0xbb, 0xbf, 0x3c, 0x3f, 0x78, 0x6d, 0x6c, 0x20, 0x76, 0x65, 0x72, 0x73, 0x69, 0x6f, 0x6e, 0x3d, 0x22, 0x31, 0x2e, 0x30, 0x22, 0x20, 0x65, 0x6e, 0x63, 0x6f, 0x64, 0x69, 0x6e, 0x67, 0x3d, 0x22, 0x75, 0x74, 0x66, 0x2d, 0x38, 0x22, 0x3f, 0x3e, 0x0d, 0x0a, 0x3c, 0x45, 0x76, 0x65, 0x6e, 0x74, 0x73, 0x3e, 0x0d, 0x0a)
$XMLEnd = @(0x0d, 0x0a, 0x3c, 0x2f, 0x45, 0x76, 0x65, 0x6e, 0x74, 0x73, 0x3e)
$enc = [system.Text.Encoding]::UTF8
$aXML = @"
<Event>
<EventData Name="Time">09/10/2017 12:54:16</EventData>
<EventData Name="WorkstationName">USER2-PC</EventData>
<EventData Name="UserName">user2</EventData>
</Event>
"@
$data1 = $enc.GetBytes($aXML)
$newXmlFile = "c:\tmp\A.xml"
$newArr = $legitXmlPrefix + $data1 + $XMLEnd
[io.file]::WriteAllBytes($newXmlFile, $newArr)
$bXML = @"
<Event>
<EventData Name="Time">09/10/2017 14:54:16</EventData>
<EventData Name="WorkstationName">USER1-PC</EventData>
<EventData Name="UserName">user1</EventData>
</Event>
<Event>
<EventData Name="Time">09/10/2017 13:54:16</EventData>
<EventData Name="WorkstationName">USER2-PC</EventData>
<EventData Name="UserName">user2</EventData>
</Event>
"@
$newXmlFile = "c:\tmp\B.xml"
$data1 = $enc.GetBytes($bXML)
$newArr = $legitXmlPrefix
$additionals = @"
<Event>
<EventData Name="Time">09/10/2017 14:54:16</EventData>
<EventData Name="WorkstationName">USER1-PC</EventData>
<EventData Name="UserName">user1</EventData>
</Event> `n
"@
$data2 = $enc.GetBytes($additionals)
if($numberOfLinesToCreateInB -gt 0){
$data2 = $data2 * $numberOfLinesToCreateInB
$newArr += $data2
}
$newArr += $data1
$newArr += $XMLEnd
[io.file]::WriteAllBytes($newXmlFile, $newArr)
}
createXMLFiles 50000
If you will run the algorithm I wrote you will see that it takes long time to find the event from A.XML in B.XML.
This is because the event in A.XML is the last event in B.XML so only when it will meet the last node at B.XML it will finish.
Is there an option to make it more efficient ?
I though maybe to use multi-thread by dividing the sections: One thread will search between event 0..1000, second will search 1001..2000, etc.
But maybe you have better solution.
References:
How can i use XmlReader in PowerShell to stream big/huge XML files?
How to create a new System.Xml.Linq.XElement with PowerShell
EDIT (FAST):
I tried with XPATH. It still work slow when passing over all the events:
Select-Xml -Path $fileB -XPath "/Events/Event" | Where-Object {
(($events[0].Node.EventData | where-object {$_.Name -eq "WorkstationName"})."#text" -eq $pc) -and
(($events[0].Node.EventData | where-object {$_.Name -eq "UserName"})."#text" -eq $username)
}
Following @Tomalak suggestion, I removed almost all the Where-Object
pipelines and things started to work much faster.
Measure-Command {
Select-Xml -Path $fileB -XPath "/Events/Event" | Where-Object {
($_.Node.EventData[1]."#text" -eq $pc) -and
($_.Node.EventData[2]."#text" -eq $username)
}
}
Days : 0
Hours : 0
Minutes : 0
Seconds : 6
Milliseconds : 253
Ticks : 62535333
TotalDays : 7.23788576388889E-05
TotalHours : 0.00173709258333333
TotalMinutes : 0.104225555
TotalSeconds : 6.2535333
TotalMilliseconds : 6253.5333
Measure-Command {
$result = $b.Events.Event | Where-Object {
($_.EventData[1]."#text" -eq $pc) -and
($_.EventData[2]."#text" -eq $username)
}
}
Days : 0
Hours : 0
Minutes : 0
Seconds : 17
Milliseconds : 700
Ticks : 177006124
TotalDays : 0.000204868199074074
TotalHours : 0.00491683677777778
TotalMinutes : 0.295010206666667
TotalSeconds : 17.7006124
TotalMilliseconds : 17700.6124
It works much faster, and also the use of -XPath
is faster (6 seconds) than the regular passing way (17 seconds).
The test was on 50,000 events, 250,000 XML lines.
EDIT (VERY FAST):
The finale is like this:
Measure-Command {
Select-Xml -Path $fileB -XPath "/Events/Event[EventData[@Name = 'WorkstationName'] = '$pc' and EventData[@Name = 'UserName'] = '$username']"
}
Days : 0
Hours : 0
Minutes : 0
Seconds : 0
Milliseconds : 609
Ticks : 6099484
TotalDays : 7.05958796296296E-06
TotalHours : 0.000169430111111111
TotalMinutes : 0.0101658066666667
TotalSeconds : 0.6099484
TotalMilliseconds : 609.9484
Which make it very fast (less than 1 second !).