0

I am trying to parse a very large log file that consists of space delimited text across about 16 fields. Unfortunately the app logs a blank line in between each legitimate one (arbitrarily doubling the lines I must process). It also causes fields to shift because it uses space as both a delineator as well as for empty fields. I couldn't get around this in LogParser. Fortunately Powershell affords me the option to reference fields from the end as well making it easier to get later fields affected by shift.

After a bit of testing with smaller sample files, I've determined that processing line by line as the file is streaming with Get-Content natively is slower than just reading the file completely using Get-Content -ReadCount 0 and then processing from memory. This part is relatively fast (<1min).

The problem comes when processing each line, even though it's in memory. It is taking hours for a 75MB file with 561178 legitimate lines of data (minus all the blank lines).

I'm not doing much in the code itself. I'm doing the following:

  • Splitting line via space as delimiter
  • One of the fields is an IP address that I am reverse DNS resolving, which is obviously going to be slow. So I have wrapped this into more code to create an in-memory arraylist cache of previously resolved IPs and pulling from it when possible. The IPs are largely the same so after a few hundred lines, resolution shouldn't be an issue any longer.
  • Saving the needed array elements into my pscustomobject
  • Adding pscustomobject to arraylist to be used later.
  • During the loop I'm tracking how many lines I've processed and outputting that info in a progress bar (I know this adds extra time but not sure how much). I really want to know progress.

All in all, it's processing some 30-40 lines per second, but obviously this is not very fast.

Can someone offer alternative methods/objectTypes to accomplish my goals and speed this up tremendously?

Below are some samples of the log with the field shift (Note this is a Windows DNS Debug log) as well as the code below that.

10/31/2022 12:38:45 PM 2D00 PACKET  000000B25A583FE0 UDP Snd 127.0.0.1      6c94 R Q [8385 A DR NXDOMAIN] AAAA   (4)pool(3)ntp(3)org(0)

10/31/2022 12:38:45 PM 2D00 PACKET  000000B25A582050 UDP Snd 127.0.0.1    3d9d R Q [8081   DR  NOERROR] A      (4)pool(3)ntp(3)org(0)

NOTE:  the issue in this case being [8385 A DR NXDOMAIN] (4 fields) vs [8081   DR  NOERROR] (3 fields)

Other examples would be the "R Q" where sometimes it's "  Q".

$Logfile = "C:\Temp\log.txt"

[System.Collections.ArrayList]$LogEntries = @()
[System.Collections.ArrayList]$DNSCache = @()

# Initialize log iteration counter
$i = 1

# Get Log data.  Read entire log into memory and save only lines that begin with a date (ignoring blank lines)
$LogData = Get-Content $Logfile -ReadCount 0 | % {$_ | ? {$_ -match "^\d+\/"}}
$LogDataTotalLines = $LogData.Length

# Process each log entry
$LogData | ForEach-Object {
        
        $PercentComplete =  [math]::Round(($i/$LogDataTotalLines * 100))
        Write-Progress -Activity "Processing log file . . ." -Status "Processed $i of $LogDataTotalLines entries ($PercentComplete%)" -PercentComplete $PercentComplete
        
        # Split line using space, including sequential spaces, as delimiter.  
        # NOTE:  Due to how app logs events, some fields may be blank leading split yielding different number of columns.  Fortunately the fields we desire
        #          are in static positions not affected by this, except for the last 2, which can be referenced backwards with -2 and -1.
        $temp = $_ -Split '\s+'
        
        # Resolve DNS name of IP address for later use and cache into arraylist to avoid DNS lookup for same IP as we loop through log
        If ($DNSCache.IP -notcontains $temp[8]) {
            $DNSEntry = [PSCustomObject]@{
                IP = $temp[8]
                DNSName = Resolve-DNSName $temp[8] -QuickTimeout -DNSOnly -ErrorAction SilentlyContinue | Select -ExpandProperty NameHost
            }
            
            # Add DNSEntry to DNSCache collection
            $DNSCache.Add($DNSEntry) | Out-Null
            
            # Set resolved DNS name to that which came back from Resolve-DNSName cmdlet. NOTE:  value could be blank.
            $ResolvedDNSName = $DNSEntry.DNSName
        } Else {
            # DNSCache contains resolved IP already.  Find and Use it.
            $ResolvedDNSName = ($DNSCache | ? {$_.IP -eq $temp[8]}).DNSName
        }

        $LogEntry = [PSCustomObject]@{
            Datetime = $temp[0] + " " + $temp[1] + " " + $temp[2]          # Combines first 3 fields Date, Time, AM/PM
            ClientIP = $temp[8]
            ClientDNSName = $ResolvedDNSName
            QueryType = $temp[-2]       # Second to last entry of array
            QueryName = ($temp[-1] -Replace "\(\d+\)",".") -Replace "^\.",""        # Last entry of array.  Replace any "(#)" characters with period and remove first period for friendly name
        }
        
        # Add LogEntry to LogEntries collection
        $LogEntries.Add($LogEntry) | Out-Null
       $i++
    }
  • 1
    Would it be possible for you to include part of such a file (sanitized) in your question so we have something to work with? – Theo Nov 03 '22 at 21:40
  • One suggestion would be to preprocess the file to remove the blank lines using PowerShell and write out a "massaged" version of the file that would be more suitable for use with Log Parser. – Bill_Stewart Nov 03 '22 at 22:03
  • I also would like to see some lines that have the problems you describe such as fields shifting. But I have a question, is this log file mostly the same with lines being added? Some time ago I figured out how to grab the end of a file and strip out the last 3 lines. Why couldn't you record the size of the file with each run of the script and then on the next run read only the new data? – Darin Nov 03 '22 at 22:04
  • Without an example of your log is hard to help – Santiago Squarzon Nov 03 '22 at 22:17
  • Perhaps the benchmarks in the bottom section of [this answer](https://stackoverflow.com/a/48915223/45375) provide pointers. – mklement0 Nov 04 '22 at 02:45
  • I usually use StreamReader. I usually skip lines with zero length, but in following code there were no blank lines. : https://stackoverflow.com/questions/74299947/how-to-assign-lines-from-a-text-file-to-different-kind-of-variables/74301075#74301075 – jdweng Nov 04 '22 at 06:37
  • See edits in OP for log sample and explanation of shift. @Bill_Stewart, reading the file and excluding the blank lines is already fast in <1min. The loop is only processing the lines that matter. – Matthew McDonald Nov 04 '22 at 16:16
  • @Darin yes, this is an ever appending log file, that eventually purges older lines as it's set with a max size – Matthew McDonald Nov 04 '22 at 16:16
  • Yep, that's exactly my point. – Bill_Stewart Nov 04 '22 at 18:59
  • @Bill_Stewart I think I misunderstood your original post. You're suggesting changing the actual file so that log parser can parse it correctly. The problem here is not the blank/empty lines. It's the ever shifting lines due to delineation. Me re-writing the file to account for that (re-delineate) would take much longer, unless I'm still misunderstanding you. – Matthew McDonald Nov 04 '22 at 19:26

1 Answers1

2

Here is a more optimized version you can try.

What changed?:

  • Removed Write-Progress, especially because it's not known if Windows PowerShell is used. PowerShell versions below 6 have a big performance impact with Write-Progress
  • Changed $DNSCache to Generic Dictionary for fast lookups
  • Changed $LogEntries to Generic List
  • Switched from Get-Content to switch -Regex -File
$Logfile = 'C:\Temp\log.txt'

$LogEntries = [System.Collections.Generic.List[psobject]]::new()
$DNSCache   = [System.Collections.Generic.Dictionary[string, psobject]]::new([System.StringComparer]::OrdinalIgnoreCase)

# Process each log entry
switch -Regex -File ($Logfile) {
    '^\d+\/' {
        # Split line using space, including sequential spaces, as delimiter.  
        # NOTE:  Due to how app logs events, some fields may be blank leading split yielding different number of columns.  Fortunately the fields we desire
        #          are in static positions not affected by this, except for the last 2, which can be referenced backwards with -2 and -1.
        $temp = $_ -Split '\s+'
        $ip   = [string] $temp[8]

        $resolvedDNSRecord = $DNSCache[$ip]

        if ($null -eq $resolvedDNSRecord) {
            $resolvedDNSRecord = [PSCustomObject]@{
                IP      = $ip 
                DNSName = Resolve-DnsName $ip -QuickTimeout -DnsOnly -ErrorAction Ignore | select -ExpandProperty NameHost
            }
            $DNSCache[$ip] = $resolvedDNSRecord
        }

        $LogEntry = [PSCustomObject]@{
            Datetime      = $temp[0] + ' ' + $temp[1] + ' ' + $temp[2]          # Combines first 3 fields Date, Time, AM/PM
            ClientIP      = $ip
            ClientDNSName = $resolvedDNSRecord.DNSName
            QueryType     = $temp[-2]       # Second to last entry of array
            QueryName     = ($temp[-1] -Replace '\(\d+\)', '.') -Replace '^\.', ''        # Last entry of array.  Replace any "(#)" characters with period and remove first period for friendly name
        }

        # Add LogEntry to LogEntries collection
        $LogEntries.Add($LogEntry)
    }
}

If it's still slow, there is still the option to use Start-ThreadJob as a multithreading approach with chunked lines (like 10000 per job).

swbbl
  • 814
  • 1
  • 4
  • 10
  • I'll give this a try. Thanks. I was thinking about a multithreaded approach as a possible solution if I could not speed the single threaded operation. FWIW, I am using WinPosh ISE for the last run. Are you saying Powershell 6 works much better with Write-Progress? – Matthew McDonald Nov 04 '22 at 16:06
  • 1
    @MatthewMcDonald Yes. Please install the most recent version of PowerShell (currently 7.2.7) if possible and try it again. Windows PowerShell (5-) is deprecated and only receives security updates. – swbbl Nov 04 '22 at 17:09
  • Ah. I do have 7.2.6 but wasn't using that as I was using the ISE so I could debug my code along the way. I don't think 7 has an ISE, no? – Matthew McDonald Nov 04 '22 at 19:28
  • OK wow. I tried your code witn Pwsh7 and it takes only 1-2 minutes from start to finish. That's an astronomical improvement! With that said, as I did not understand how there could be such an improvement, I decided to test my original code, just without Write-Progress and within Pwsh7, and it is just as fast as yours. So I guess my entire issue was Write-Progress??? – Matthew McDonald Nov 04 '22 at 20:38
  • FWIW, I added the write-progress back with yours, and it only added <30 seconds additional time. Crazy. I noticed in Pwsh7 the write-progress would drastically skip numbers on display, as a result of it processing so fast, but it seems the write-progress in WinPosh literally displayed every number in sequential order. Guess that's where the improvements came in Pwsh7. – Matthew McDonald Nov 04 '22 at 20:38
  • 1
    @MatthewMcDonald Yes, seems `Write-Progress` was the main cause of it. Instead of ISE (which is also not maintained anymore) you should switch to `vscode` with the PowerShell extension (there is also an ISE mode/theme available). You can find the setup guide here:: https://learn.microsoft.com/en-us/powershell/scripting/dev-cross-plat/vscode/using-vscode – swbbl Nov 07 '22 at 05:09
  • appreciate the assist. – Matthew McDonald Nov 07 '22 at 15:03