I currently have to parse around 35000 to 50000 log files to extract lines of interest. Due to limitations and policies I have to do it in Powershell without any external libraries.
Size of logs is between between 100 kB and 1000 kB.
I write the results into a single file which has about 5 million to 7 million lines at the end.
The performance is gruesome... it takes around 1 hour and 15 minutes to parse 50000 logs and write the results into the output file.
I only know that this is bad for performance:
if (($result[1..8] -join "").Trim() -ne "") {
As well as having a nested loop with a complexity of O(V*E) correct me if I'm wrong
foreach ($file in $fileList) {
...
while (($line = $reader.ReadLine()) -ne $null) {
...
The $search
variable hold the string "Custom Log Entry: "
As per your request here is an example of the log files content:
Sat Oct 2 00:20:12 2021 Info: A String with some Info: mail.address@domain.com
Sat Oct 2 00:20:12 2021 Info: Second string with some info
Sat Oct 2 00:20:12 2021 Info: XZY 000000000 some information about the current line
Sat Oct 2 00:20:12 2021 Info: XZY 000000000 some action has happened: action
Sat Oct 2 00:20:12 2021 Info: XZY 000000000 something was used: used object
Sat Oct 2 00:20:12 2021 Info: XZY 000000000 some information about the current line
Sat Oct 2 00:20:12 2021 Info: XZY 000000000 some information about the current line
Sat Oct 2 00:20:12 2021 Info: XZY 000000000 Some Data: verdict negative
Sat Oct 2 00:20:12 2021 Info: XZY 000000000 Custom Log Entry: IMPORTANT LINE
Sat Oct 2 00:20:12 2021 Info: XZY 000000000 some information
Sat Oct 2 00:20:12 2021 Custom Log Entry: IMPORTANT LINE
I looked at foreach -parallel (...)
but the limitations of Workflows are just terrible...
Maybe just open each file, write it to a MemoryStream and then process all (RAM isn't an issue)?
Can you guys give me any advice on how to speed this up?
Here's a more thorough look at the code:
try {
# Output stream in which we write.
$outStream = New-Object System.IO.FileStream( `
"C:\Users\anon\outfile.csv", `
[System.IO.FileMode]::Create, `
[System.IO.FileAccess]::Write, `
[System.IO.FileAccess]::Read)
# Writer object which is used to write to stream.
$outWrite = New-Object System.IO.StreamWriter($outStream)
# Iterate through files.
foreach ($file in $fileList) {
try {
# Create reader stream for log.
$reader = New-Object System.IO.StreamReader($file)
# Length of time stamp
$fileNameDateLen = fileNameDateFormat.Length
$fileNameDate = $file.Substring($file.Length - 2 - $fileNameDateLen, $fileNameDateLen)
# Convert to usable DateTime object.
$fileNameDateConverted = ([System.DateTime]::ParseExact(`
$fileNameDate, `
$fileNameDateFormat, `
[System.Globalization.CultureInfo]::InvariantCulture))
# Change format of extracted file name date.
$fileNameDateConverted = $fileNameDateConverted.Date.ToString("yyyy-MM-dd")
# StringBuilder for storing row values.
$rowBuffer = New-Object System.Text.StringBuilder
# Iterate through files.
while (($line = $reader.ReadLine()) -ne $null) {
# Validate line.
if ($line -Match $search) {
# Calc position of relevant data.
$pos = $line.IndexOf($search) + $searchLength
# Actual length of relevant data.
$relLength = $line.Length - $pos
# Extract relevant data.
$result = $line.Substring($pos, $relLength).Trim().Split(';')
# Check if line is empty.
if (($result[1..8] -join "").Trim() -ne "") {
# Get timestamp from line.
#$timeValue = $timeRegex.Match($line).Value
$timeValue = $line.Substring(12, 8)
# Combine date from file name with time.
$dateString = "$fileNameDateConverted $timeValue"
# Format timestamp.
$timeStamp = Get-Date $dateString -Format "yyyy-MM-dd HH:mm:ss"
# Format last result.
$result[8] = $result[8] -Replace "^""|""$"
# Create CSV row.
[void] $rowBuffer.AppendLine("$timeStamp;$($result[1]);$($result[2]);" `
+ "$($result[3]);$($result[4]);$($result[5]);" `
+ "$($result[6]);$($result[7]);$($result[8])")
}
}
}
# Write results to file.
$outWrite.Write($rowBuffer.ToString())
# Clear buffer.
[void] $rowBuffer.Clear()
# Close input.
[void] $reader.Close()
# Free input memory.
[void] $reader.Dispose()
}
catch {
if ($rowBuffer -ne $null) {
[void] $rowBuffer.Clear()
}
if ($reader -ne $null) {
[void] $reader.Close()
[void] $reader.Dispose()
}
}
}
$sp.Stop()
Write-Host "Finished after $($sp.Elapsed)"
}
catch {
if ($outWrite -ne $null) {
[void] $outWrite.Dispose()
}
if ($outStream -ne $null) {
[void] $outStream.Dispose()
}
}
finally {
# Close and free output.
[void] $outWrite.Close()
[void] $outStream.Close()
[void] $outWrite.Dispose()
[void] $outStream.Dispose()
}