0

I currently have to parse around 35000 to 50000 log files to extract lines of interest. Due to limitations and policies I have to do it in Powershell without any external libraries.

Size of logs is between between 100 kB and 1000 kB.

I write the results into a single file which has about 5 million to 7 million lines at the end.

The performance is gruesome... it takes around 1 hour and 15 minutes to parse 50000 logs and write the results into the output file.

I only know that this is bad for performance:

if (($result[1..8] -join "").Trim() -ne "") {

As well as having a nested loop with a complexity of O(V*E) correct me if I'm wrong

foreach ($file in $fileList) {
   ...
   while (($line = $reader.ReadLine()) -ne $null) {
   ... 

The $search variable hold the string "Custom Log Entry: "

As per your request here is an example of the log files content:

Sat Oct 2 00:20:12 2021 Info: A String with some Info: mail.address@domain.com
Sat Oct 2 00:20:12 2021 Info: Second string with some info
Sat Oct 2 00:20:12 2021 Info: XZY 000000000 some information about the current line
Sat Oct 2 00:20:12 2021 Info: XZY 000000000 some action has happened: action
Sat Oct 2 00:20:12 2021 Info: XZY 000000000 something was used: used object
Sat Oct 2 00:20:12 2021 Info: XZY 000000000 some information about the current line
Sat Oct 2 00:20:12 2021 Info: XZY 000000000 some information about the current line
Sat Oct 2 00:20:12 2021 Info: XZY 000000000 Some Data: verdict negative
Sat Oct 2 00:20:12 2021 Info: XZY 000000000 Custom Log Entry: IMPORTANT LINE
Sat Oct 2 00:20:12 2021 Info: XZY 000000000 some information
Sat Oct 2 00:20:12 2021 Custom Log Entry: IMPORTANT LINE

I looked at foreach -parallel (...) but the limitations of Workflows are just terrible...

Maybe just open each file, write it to a MemoryStream and then process all (RAM isn't an issue)?

Can you guys give me any advice on how to speed this up?

Here's a more thorough look at the code:

try {
    # Output stream in which we write.
    $outStream = New-Object System.IO.FileStream( `
        "C:\Users\anon\outfile.csv", `
        [System.IO.FileMode]::Create, `
        [System.IO.FileAccess]::Write, `
        [System.IO.FileAccess]::Read)
    # Writer object which is used to write to stream.
    $outWrite = New-Object System.IO.StreamWriter($outStream)
    # Iterate through files.
    foreach ($file in $fileList) {       

        try {
            # Create reader stream for log.
            $reader = New-Object System.IO.StreamReader($file)    
            # Length of time stamp
            $fileNameDateLen = fileNameDateFormat.Length 
            $fileNameDate = $file.Substring($file.Length - 2 - $fileNameDateLen, $fileNameDateLen)
            # Convert to usable DateTime object.
            $fileNameDateConverted = ([System.DateTime]::ParseExact(`
                $fileNameDate, `
                $fileNameDateFormat, `
                [System.Globalization.CultureInfo]::InvariantCulture))
            # Change format of extracted file name date.
            $fileNameDateConverted = $fileNameDateConverted.Date.ToString("yyyy-MM-dd")                        
            # StringBuilder for storing row values.
            $rowBuffer = New-Object System.Text.StringBuilder
            # Iterate through files.
            while (($line = $reader.ReadLine()) -ne $null) {
                # Validate line.
                if ($line -Match $search) {       
                    # Calc position of relevant data.
                    $pos = $line.IndexOf($search) + $searchLength
                    # Actual length of relevant data.
                    $relLength = $line.Length - $pos
                    # Extract relevant data.
                    $result = $line.Substring($pos, $relLength).Trim().Split(';')      
                    # Check if line is empty.
                    if (($result[1..8] -join "").Trim() -ne "") {
                        # Get timestamp from line.
                        #$timeValue = $timeRegex.Match($line).Value
                        $timeValue = $line.Substring(12, 8)
                        # Combine date from file name with time.
                        $dateString = "$fileNameDateConverted $timeValue"
                        # Format timestamp.
                        $timeStamp = Get-Date $dateString -Format "yyyy-MM-dd HH:mm:ss"
                        # Format last result.
                        $result[8] = $result[8] -Replace "^""|""$"
                        # Create CSV row.
                        [void] $rowBuffer.AppendLine("$timeStamp;$($result[1]);$($result[2]);" `
                            + "$($result[3]);$($result[4]);$($result[5]);" `
                            + "$($result[6]);$($result[7]);$($result[8])")          
                    }
                }
            }
            # Write results to file.
            $outWrite.Write($rowBuffer.ToString())
            # Clear buffer.
            [void] $rowBuffer.Clear()
            # Close input.
            [void] $reader.Close()
            # Free input memory.
            [void] $reader.Dispose()
        }   
        catch {
            if ($rowBuffer -ne $null) {
                [void] $rowBuffer.Clear()
            }
            if ($reader -ne $null) {
                [void] $reader.Close()
                [void] $reader.Dispose()
            }
        }
    }
    $sp.Stop()
    Write-Host "Finished after $($sp.Elapsed)"
} 
catch {
    if ($outWrite -ne $null) { 
        [void] $outWrite.Dispose()
    }
    if ($outStream -ne $null) {
        [void] $outStream.Dispose()
    }
}
finally {
    # Close and free output.
    [void] $outWrite.Close()
    [void] $outStream.Close()
    [void] $outWrite.Dispose()
    [void] $outStream.Dispose()
}
  • 1
    While you may very well get some excellent advice here on StackOverflow, you might want to consider posting questions like this to https://codereview.stackexchange.com/ instead. That said, I'll go take a look at what you have and see if I spot anything. There's a few users here that are amazing at that kind of thing, but I may come across something that could help. – TheMadTechnician Oct 07 '21 at 19:56
  • 1
    What is stored in `$search` and what does the input look like? – Mathias R. Jessen Oct 07 '21 at 19:58
  • @TheMadTechnician Thank you I didn't now about the code review domain :) –  Oct 07 '21 at 20:09
  • 2
    You could try pre-compiling your regex - something like ```$myRegex = [Regex]::new($search, "Compiled, IgnoreCase, CultureInvariant")``` and then ```$match = $myRegex.Match($line); if( $match.Success ) { ... }```. The ```$match``` object gives you the start, length and string that matches so you can maybe get to ```$result``` quicker (```$result = $match.Value.Trim().Split(";")```. You'd need to measure performance to see if it actually helps though... – mclayton Oct 07 '21 at 20:10
  • @MathiasR.Jessen $search = "Custom Log Entry: " and the lines that I want look like this "Sat Oct 2 00:08:26 2021 Some: Text I don't care about Custom Log Entry: data;data;data;data;data" and the other log lines only share the date time value at the beginning –  Oct 07 '21 at 20:10
  • 3
    @Max Please add a few sample entries [to your question](https://stackoverflow.com/posts/69487053/edit) (scrubbed of anything sensitive if necessary) – Mathias R. Jessen Oct 07 '21 at 20:12
  • @mclayton I will do a performance test on that, thank you! Does the -match operator actually use regex? –  Oct 07 '21 at 20:13
  • @MathiasR.Jessen The other lines only share the time stamp that has a fixed length everything else is completely different and I only look for lines that contain the string "Custom Log Entry: ". I can provide a few examples tomorrow but as I said the only qualifying portion is the "Custom Log Entry: " string which position may vary. –  Oct 07 '21 at 20:17
  • 4
    It does use regex under the covers, but I've got no idea whether it optimises repeated calls that use same pattern, or blindly re-parses your pattern every time. If you create your own regex object you *know* it's pre-compiled, and you get all the positional properties of the ```System.Text.RegularExpressions.Match``` object as well. – mclayton Oct 07 '21 at 20:18
  • 2
    @mclayton: PowerShell [performs its own regex caching](https://github.com/PowerShell/PowerShell/blob/e98a8c8601b8f15f479e894323beafb9f6a2300c/src/System.Management.Automation/engine/lang/parserutils.cs#L1208), up to 1000 regexes per distinct set of regex options. [.NET itself performs caching too](https://docs.microsoft.com/en-us/dotnet/standard/base-types/compilation-and-reuse-in-regular-expressions), but only for regexes passed to _static_ method calls. Such caching uses "high-level codes" as the compilation target, whereas only _explicit_ compilation yields MSIL code that can be JITted. – mklement0 Oct 08 '21 at 01:58
  • 1
    Your biggest perf gains will be to use multi-threading. Using RunSpaceFactories, move everything inside your loop in a script block. Create the script block so it takes the filepath as an argument. Create a runspace using your script block and pass in the filenames and add them all into your runspacefactory. Because each runspace will run into locking issues writing to a single output file, you can have each thread write to it's own log file and then after they're all done processing, you can synchronously consolidate all of the output files. – thepip3r Oct 08 '21 at 03:32
  • The fact that you even have a selfmade CSV solution (rather than using [`ConvertTo-Csv`](https://docs.microsoft.com/powershell/module/microsoft.powershell.utility/convertto-csv)) suggest that you have an unfounded prejudiced look on the PowerShell performance. If you state the PowerShell performance as gruesome you should at least try to use **native PowerShell** commands as the `Switch -File` and cmdlets as `Select-String` and the pipeline. Anyways, to be able adequately answer you and test/show you anything better, we need a [mcve] including the `$search` contents and input examples. – iRon Oct 08 '21 at 15:18
  • @Max Still waiting for you to post an actual log sample... – Mathias R. Jessen Oct 09 '21 at 15:13
  • @MathiasR.Jessen added a few log lines to the original question. Issue is mostly resolved after implementing multi-threading as thepip3r suggested –  Oct 11 '21 at 09:16

1 Answers1

0

I followed thepip3r recommendation and went Multi-Threading which was by far the most relevant thing to do.

I didn't write the parse results into individual files as memory just isn't an issue for me in this case. All of my return values are stored in a ConcurrentBag which is a thread-safe collection.
For everyone with limited memory you might want to use an atomic write like this (it will slow you down a bit):

    # Atomic write stream object for thread-safe file append.
    $stream = New-Object System.IO.FileStream( `
        $destFile, `
        [System.IO.FileMode]::OpenOrCreate, `
        [System.Security.AccessControl.FileSystemRights]::AppendData, `
        [System.IO.FileAccess]::Write, `
        0x2000, 
        [System.IO.FileOptions]::None)

    $writer = New-Object System.IO.StreamWriter($stream)
    # Enable AutoFlush or FileStream will keep on buffering.
    $writer.AutoFlush = $true
    # Get thread results and dispose them.
    foreach ($thread in $parserThreads) {        
        $writer.WriteLine(($thread.Thread.EndInvoke($thread.Status) | Out-String))
        $thread.Thread.Dispose()
    }
    $parserThreads = $null

After utilizing multithreading the time went down from 70-83 minutes to 20-33 minutes (depending on how many large(r) input files were loaded.

Not using Export-CSV or pipes is simply quicker. Pipes are rather slow and I think they are unnecessary (most of the time). Using things like ConvertTo-Csv or Select-String or whatever just adds overhead that I don't want nor need for a simple task like this. Collecting the input files has been done with pipes in the beginning similar like in this post. The two nested foreach loops with a single IF condition together are about 200 times faster then the piped version (50 seconds for pipe version and <250ms for nested-loops)... so yes, I do have prejudice against PowerShells performance and pipes.

Thank you all for your comments, help, recommendations.

Here is the result code:

Clear

try {
    
    $ParseLog = {

        Param (
            # File to parse
            [string]$targetFile,
            # Search string
            [string]$searchValue
        )

        try {
            # Length of search string. 
            $svLength = $searchValue.Length            
            # Date time format found within file name.
            $fileNameDateFormat = "yyyyMMddTHHmmss"
            # Init reader object.
            $reader = New-Object System.IO.StreamReader($targetFile)    
            # Get date from file name.
            $fileNameTimeLen = $fileNameDateFormat.Length
            # Extract the date time value from the file name. 
            $fileNameTime = $targetFile.Substring($targetFile.Length - 2 - $fileNameTimeLen, $fileNameTimeLen)
            # Convert to usable DateTime object.
            $fileDateConverted = ([System.DateTime]::ParseExact(`
                $fileNameTime, `
                $fileNameDateFormat, `
                [System.Globalization.CultureInfo]::InvariantCulture))
            # Change format of extracted file name date.
            $fileDateConverted = $fileDateConverted.Date.ToString("yyyy-MM-dd")
            # StringBuilder for storing row values.
            $rowBuffer = [System.Text.StringBuilder]::new()
            # Iterate through files.
            while (($line = $reader.ReadLine()) -ne $null) {
                # Validate line.
                if ($line -Match $searchValue) {       
                    # Calc position of relevant data.
                    $pos = $line.IndexOf($searchValue) + $svLength
                    # Actual length of relevant data.
                    $relLength = $line.Length - $pos
                    # Extract relevant data.
                    $result = $line.Substring($pos, $relLength).Trim().Split(';')  
                    # Check if line is empty.
                    if (($result[1..8] -join "").Trim() -ne "") {
                        # Get timestamp from line (position and length is fixed).
                        $timeValue = $line.Substring(12, 8)
                        # Combine date from file name with time.
                        $dateString = "$fileDateConverted $timeValue"
                        # Format timestamp.
                        $timeStamp = Get-Date $dateString -Format "yyyy-MM-dd HH:mm:ss"
                        # Format last result.
                        $result[8] = $result[8] -Replace "^""|""$" -Replace "\\xc3\\xbc", "ue"
                        # Create CSV row.
                        [void] $rowBuffer.AppendLine("$timeStamp;$($result[1]);$($result[2]);" `
                            + "$($result[3]);$($result[4]);$($result[5]);" `
                            + "$($result[6]);$($result[7]);$($result[8])")          
                    }
                }
            }
            # Close input.
            [void] $reader.Close()
            # Free input memory.
            [void] $reader.Dispose()
            # Return the parser result.
            return $rowBuffer.ToString()
        }   
        catch {
            # Free memory
            if ($rowBuffer -ne $null) {
                [void] $rowBuffer.Clear()
            }
            # Free StreamReader memory.
            if ($reader -ne $null) {
                [void] $reader.Close()
                [void] $reader.Dispose()
            }
            # Return error for debugging.
            #return $error[0].Exception()
        }
    }

    # StopWatch object to measure performance.
    $sp = [System.Diagnostics.Stopwatch]::StartNew()
    # Root path of log directories
    $basePath = "\\some\network\path"
    # Test directories
    $directories = @("dir0", "dir1")
    # Text we need to look for within the the logs files.
    $search = "Custom Log Entry: "
    # Cache length of search as we need it quite often.
    $searchLength = $search.Length
    # Test start date
    $Start = Get-Date -Date "01.07.2021" 
    # Test end date
    $End = Get-Date -Date "11.10.2021"
    # List object storing file paths.
    $fileList = New-Object System.Collections.Generic.List[string]
    # List object storing directory paths. 
    $dirList = New-Object System.Collections.Generic.List[string]

    # Collect all directories and sub-directories.
    foreach ($dir in $directories) {    
        # Combine directories
        $path = "$basePath\$dir\subdir"
        # Get directory info.
        $dirInfo = New-Object System.IO.DirectoryInfo($path) 
        # iterate through sub-directories.
        foreach ($subDir in $dirInfo.GetDirectories()) {
            # Cache creation time.
            $ct = $subDir.CreationTime.Date
            # Check if folder meets requirements.
            if ($ct -ge $Start.Date -and $ct -le $End.Date) {
                $dirList.Add($subDir.FullName)
            }        
        }
    }
    $sp.Stop()
    Write-Host "Received all directories after: $($sp.Elapsed)"

    $sp.Restart()
    # Temp array for files.
    $files = @()
    # Iterate through files in folder list.
    foreach ($dir in $dirList) {
        # Get directory info.
        $dirInfo = New-Object System.IO.DirectoryInfo($dir)
        # Get all log files.
        $files = $dirInfo.GetFiles("*.log", [System.IO.SearchOption]::AllDirectories)
        # Iterate through files.
        foreach ($file in $files) {
            # Cache last write time.
            $lwt = $file.LastWriteTime
            # Check if file is applicable.
            if ($lwt -ge $Start.Date -and $lwt -le $End.Date) {
                $fileList.Add($file.FullName)
            }
        }    
    }
    $sp.Stop()
    Write-Host "Received all files after: $($sp.Elapsed)"

    # CSV header string.
    $header = "h0;h1;h2;h3;h4;h5;h6;h7;h8"
    
    # Use a threadsafe collection object to store thread results. 
    $threadSafeBuffer = [System.Collections.Concurrent.ConcurrentBag[string]]::new()
    # Create new thread pool.
    $parserThreadPool = [RunspaceFactory]::CreateRunspacePool(1, [int]$env:NUMBER_OF_PROCESSORS)
    # Set multithreaded apartment state.
    $parserThreadPool.ApartmentState = "MTA"
    $parserThreadPool.Open()
    $parserThreads = @()

    $sp.Restart()

    # Iterate through files.
    foreach ($file in $fileList) {     
        # Create thread for file.
        $parserThread = [PowerShell]::Create()
        # Load parser script.
        [void] $parserThread.AddScript($ParseLog)
        # Pass parameters.
        [void] $parserThread.AddArgument($file) 
        [void] $parserThread.AddArgument($search) 
        # Add to thread pool
        $parserThread.RunspacePool = $parserThreadPool
        # Thread callback
        $parserThreads += [PSCustomObject]@{ `
            Thread = $parserThread; `
            Status = $parserThread.BeginInvoke() `
        }                
    }

    # Wait for threads to finish.
    while ($parserThreads.Status.IsCompleted -notcontains $true) {}
    # Get thread results and dispose them.
    foreach ($thread in $parserThreads) {                
        [void] $threadSafeBuffer.TryAdd($thread.Thread.EndInvoke($thread.Status))
        $thread.Thread.Dispose()
    }
    $sp.Stop()
    Write-Host "Finished parsing after: $($sp.Elapsed)"

    $sp.Restart()
    # Create stream and writer for output.
    $stream = New-Object System.IO.FileStream( `
        "c:\users\anon\documents\parser_result\result.csv", `
        [System.IO.FileMode]::Create, `
        [System.IO.FileAccess]::Write, `
        [System.IO.FileAccess]::Read)
    $writer = New-Object System.IO.StreamWriter($stream)
    
    # Write results into file.         
    foreach ($item in $threadSafeBuffer) {
        $writer.WriteLine($item)
    }

    # Free memory.
    $threadSafeBuffer = $null
    $writer.Close()
    $writer.Dispose()
    $stream.Close()
    $stream.Dispose()
    $parserThreadPool.Close()
    $parserThreadPool.Dispose()

    $sp.Stop()
    Write-Host "Finished writing parser results after $($sp.Elapsed)"
} 
catch {

    $threadSafeBuffer = $null

    if ($parserThreadPool -ne $null) {
        $parserThreadPool.Close() 
        $parserThreadPool.Dispose() 
    }

    if ($writer -ne $null) {
        $writer.Close()
        $writer.Dispose()
    }   
    
    if ($stream -ne $null) {     
        $stream.Close()
        $stream.Dispose()
    }
}