2

I have a data foundation of 36 .log-Files, which I need to pre process in order to load them into a pandas data frame for data visualization within python frameworks.

To provide an example of a single line within one of the .log-Files:

[16:24:42]: Downloaded 0 Z_SYSTEM_FM traces from DEH, clients (282) from 00:00:00,000 to 00:00:00,000 

From several sources and posts on here I figure out the following code to be the best performing one:

foreach ($f in $files){

    $date = $f.BaseName.Substring(22,8)

    ((Get-Content $f) -match "^.*\bDownloaded\b.*$") -replace "[[]", "" -replace "]:\s", " " 
    -replace "Downloaded " -replace "Traces from " -replace ",.*" -replace "$", " $date" 
    | add-content CleanedLogs.txt

}

Variable $date is containing the date, the respective .log-file is logging.

I am not able to change the input text data. I tried to read in the 1,55GB using -raw, but I couldn't manage to split up the resulting single string after processing all operations. Additionally I tried to use more regex expression, but there was no reduction of the total runtime. Maybe there is a way to use grep for this operations?

Maybe someone has a genious tweak to speed up this operation. At the moment this operation takes up close to 20 minutes to compute. Thank you very much!

Mike_H
  • 1,343
  • 1
  • 14
  • 31
  • Is there a solid single regex that is including all -match and -replace oeprations? I tried it for 2 hours, but couldn't figure out how to do it. I will try your suggestions for read and write! – Mike_H Apr 02 '19 at 07:35
  • Re regexes: probably not; you need at least 1 `-match` to select only lines of interest, and then start replacing (`-replace` doesn't filter, it passes lines that don't match through). You can at least consolidate all those `-replace` operations that _remove_ strings into one. – mklement0 Apr 02 '19 at 07:43

3 Answers3

2

The key to better performance is:

  • Avoid use of the pipeline and cmdlets, in particular for file I/O (Get-Content, Add-Content)
  • Avoid looping in PowerShell code.
    • Instead, chain array-aware operators such as -match and -replace - which you're already doing.
    • Consolidate your regexes to make fewer -replace calls.
    • Use precompiled regexes.

To put it all together:

# Create precompiled regexes.
# Note: As written, they make the matching that -replace performs
#       case-*sensitive* (and culture-sensitive), 
#       which speeds things up slightly.
#       If you need case-*insensitive* matching, use option argument
#       'Compiled, IgnoreCase' instead.
$reMatch    = New-Object regex '\bDownloaded\b', 'Compiled'
$reReplace1 = New-Object regex 'Downloaded |Traces from |\[', 'Compiled'
$reReplace2 = New-Object regex '\]:\s', 'Compiled'
$reReplace3 = New-Object regex ',.*', 'Compiled'

# The platform-appropriate newline sequence.
$nl = [Environment]::NewLine

foreach ($f in $files) {

  $date = $f.BaseName.Substring(22,8)

  # Read all lines into an array, filter and replace, then join the
  # resulting lines with newlines and append the resulting single string
  # to the log file.
  [IO.File]::AppendAllText($PWD.ProviderPath + '/CleanedLogs.txt',
    ([IO.File]::ReadAllLines($f.FullName) -match
      $reMatch -replace 
        $reReplace1 -replace 
          $reReplace2, ' ' -replace 
            $reReplace3, " $date" -join 
              $nl) + $nl
  )

}

Note that each file must fit into memory as a whole as an array of lines, plus a proportion of it (both as an array and as a single, multi-line string) whose size depends on how many lines are filtered in.

mklement0
  • 382,024
  • 64
  • 607
  • 775
  • I've noticed massive improvements with system.IO.file methods. Does precompiling regex make similar gains? If the case sensitivity is causing it to slow, could a -creplace substitute for precompiling? – Blaisem Oct 26 '21 at 15:38
  • 1
    @Blaisem, I'm not sure how much you gain by precompiling; a certain number regexes are cached by default, both by PowerShell (1000, fixed) and .NET (15, configurable), but such automatically cached ones use a _higher-level_ compiled representation than explicit precompiling, which creates MSIL that the JITTer can compile to native code. With letter-heavy regexes, using `-creplace` probably does help - again, not sure how much. – mklement0 Oct 26 '21 at 17:34
1

I had a similar problem in the past. Long story short, using .NET directly is way faster while using large type of files. You can learn more by reading performance considerations.

The fastest way probably would be by using IO.FileStream. For example:

$File = "C:\Path_To_File\Logs.txt"
$FileToSave = "C:\Path_To_File\result.txt"
$Stream = New-Object -TypeName IO.FileStream -ArgumentList ($File), ([System.IO.FileMode]::Open), ([System.IO.FileAccess]::Read), ([System.IO.FileShare]::ReadWrite)
$Reader = New-Object -TypeName System.IO.StreamReader -ArgumentList ($Stream, [System.Text.Encoding]::ASCII, $true)
$Writer = New-Object -TypeName System.IO.StreamWriter -ArgumentList ($FileToSave)
while (!$Reader.EndOfStream)
{
    $Box = $Reader.ReadLine()
    if($Box -match "^.*\bDownloaded\b.*$")
    {
        $ReplaceLine = $Box -replace "1", "1234" -replace "[[]", ""
        $Writer.WriteLine($ReplaceLine)
    }
}
$Reader.Close()
$Writer.Close()
$Stream.Close()

You should be able to edit the code above for your needs pretty easily. For getting list of files you can use Get-ChildItem.

Also I advice you to read this stackoverflow post.

nemze
  • 111
  • 4
  • I get an error on ```$Box = $Reader.ReadLine()```: ```It is not possible to call a method for an expression that is NULL In C:\Users\jmoecke\PycharmProjects\TraceDashboardV3\Stackoverflow2.ps1:9 Characters:9 + $Box = $Reader.ReadLine() + ~~~~~~~~~~~~~~~~~~~~~~~~~ + CategoryInfo : InvalidOperation: (:) [], RuntimeException + FullyQualifiedErrorId : InvokeMethodOnNull```. Do you have an idea what could be wrong? – Mike_H Apr 02 '19 at 11:27
  • Code is working fine for me. Try `$Reader` without `Encoding` parameter, like: `$Reader = New-Object -TypeName System.IO.StreamReader -ArgumentList $Stream`. What changes did you make in your code? – nemze Apr 02 '19 at 12:03
0

Perhaps this will speed up things for you:

$outFile = Join-Path -Path $PSScriptRoot -ChildPath 'CleanedLogs.txt'
$files   = Get-ChildItem -Path '<YOUR ROOTFOLDER>' -Filter '*.txt' -File
foreach ($f in $files){
    $date = $f.BaseName.Substring(22,8)
    [string[]]$lines = ([System.IO.File]::ReadAllLines($f.FullName) | Where-Object {$_ -match '^.*\bDownloaded\b.*$'} | ForEach-Object {
        ($_ -replace '\[|Downloaded|Traces from|,.*', '' -replace ']:\s', ' ' -replace '\s+', ' ') + " $date"
    })
    [System.IO.File]::AppendAllLines($outFile, $lines)
}
Theo
  • 57,719
  • 8
  • 24
  • 41