How to speed up processing of ~million lines of text in log file

Question

I am trying to parse a very large log file that consists of space delimited text across about 16 fields. Unfortunately the app logs a blank line in between each legitimate one (arbitrarily doubling the lines I must process). It also causes fields to shift because it uses space as both a delineator as well as for empty fields. I couldn't get around this in LogParser. Fortunately Powershell affords me the option to reference fields from the end as well making it easier to get later fields affected by shift.

After a bit of testing with smaller sample files, I've determined that processing line by line as the file is streaming with Get-Content natively is slower than just reading the file completely using Get-Content -ReadCount 0 and then processing from memory. This part is relatively fast (<1min).

The problem comes when processing each line, even though it's in memory. It is taking hours for a 75MB file with 561178 legitimate lines of data (minus all the blank lines).

I'm not doing much in the code itself. I'm doing the following:

Splitting line via space as delimiter
One of the fields is an IP address that I am reverse DNS resolving, which is obviously going to be slow. So I have wrapped this into more code to create an in-memory arraylist cache of previously resolved IPs and pulling from it when possible. The IPs are largely the same so after a few hundred lines, resolution shouldn't be an issue any longer.
Saving the needed array elements into my pscustomobject
Adding pscustomobject to arraylist to be used later.
During the loop I'm tracking how many lines I've processed and outputting that info in a progress bar (I know this adds extra time but not sure how much). I really want to know progress.

All in all, it's processing some 30-40 lines per second, but obviously this is not very fast.

Can someone offer alternative methods/objectTypes to accomplish my goals and speed this up tremendously?

Below are some samples of the log with the field shift (Note this is a Windows DNS Debug log) as well as the code below that.

10/31/2022 12:38:45 PM 2D00 PACKET  000000B25A583FE0 UDP Snd 127.0.0.1      6c94 R Q [8385 A DR NXDOMAIN] AAAA   (4)pool(3)ntp(3)org(0)

10/31/2022 12:38:45 PM 2D00 PACKET  000000B25A582050 UDP Snd 127.0.0.1    3d9d R Q [8081   DR  NOERROR] A      (4)pool(3)ntp(3)org(0)

NOTE:  the issue in this case being [8385 A DR NXDOMAIN] (4 fields) vs [8081   DR  NOERROR] (3 fields)

Other examples would be the "R Q" where sometimes it's "  Q".

$Logfile = "C:\Temp\log.txt"

[System.Collections.ArrayList]$LogEntries = @()
[System.Collections.ArrayList]$DNSCache = @()

# Initialize log iteration counter
$i = 1

# Get Log data.  Read entire log into memory and save only lines that begin with a date (ignoring blank lines)
$LogData = Get-Content $Logfile -ReadCount 0 | % {$_ | ? {$_ -match "^\d+\/"}}
$LogDataTotalLines = $LogData.Length

# Process each log entry
$LogData | ForEach-Object {
        
        $PercentComplete =  [math]::Round(($i/$LogDataTotalLines * 100))
        Write-Progress -Activity "Processing log file . . ." -Status "Processed $i of $LogDataTotalLines entries ($PercentComplete%)" -PercentComplete $PercentComplete
        
        # Split line using space, including sequential spaces, as delimiter.  
        # NOTE:  Due to how app logs events, some fields may be blank leading split yielding different number of columns.  Fortunately the fields we desire
        #          are in static positions not affected by this, except for the last 2, which can be referenced backwards with -2 and -1.
        $temp = $_ -Split '\s+'
        
        # Resolve DNS name of IP address for later use and cache into arraylist to avoid DNS lookup for same IP as we loop through log
        If ($DNSCache.IP -notcontains $temp[8]) {
            $DNSEntry = [PSCustomObject]@{
                IP = $temp[8]
                DNSName = Resolve-DNSName $temp[8] -QuickTimeout -DNSOnly -ErrorAction SilentlyContinue | Select -ExpandProperty NameHost
            }
            
            # Add DNSEntry to DNSCache collection
            $DNSCache.Add($DNSEntry) | Out-Null
            
            # Set resolved DNS name to that which came back from Resolve-DNSName cmdlet. NOTE:  value could be blank.
            $ResolvedDNSName = $DNSEntry.DNSName
        } Else {
            # DNSCache contains resolved IP already.  Find and Use it.
            $ResolvedDNSName = ($DNSCache | ? {$_.IP -eq $temp[8]}).DNSName
        }

        $LogEntry = [PSCustomObject]@{
            Datetime = $temp[0] + " " + $temp[1] + " " + $temp[2]          # Combines first 3 fields Date, Time, AM/PM
            ClientIP = $temp[8]
            ClientDNSName = $ResolvedDNSName
            QueryType = $temp[-2]       # Second to last entry of array
            QueryName = ($temp[-1] -Replace "\(\d+\)",".") -Replace "^\.",""        # Last entry of array.  Replace any "(#)" characters with period and remove first period for friendly name
        }
        
        # Add LogEntry to LogEntries collection
        $LogEntries.Add($LogEntry) | Out-Null
       $i++
    }

Would it be possible for you to include part of such a file (sanitized) in your question so we have something to work with? — Theo, Nov 03 '22 at 21:40
One suggestion would be to preprocess the file to remove the blank lines using PowerShell and write out a "massaged" version of the file that would be more suitable for use with Log Parser. — Bill_Stewart, Nov 03 '22 at 22:03
I also would like to see some lines that have the problems you describe such as fields shifting. But I have a question, is this log file mostly the same with lines being added? Some time ago I figured out how to grab the end of a file and strip out the last 3 lines. Why couldn't you record the size of the file with each run of the script and then on the next run read only the new data? — Darin, Nov 03 '22 at 22:04
Perhaps the benchmarks in the bottom section of [this answer](https://stackoverflow.com/a/48915223/45375) provide pointers. — mklement0, Nov 04 '22 at 02:45
I usually use StreamReader. I usually skip lines with zero length, but in following code there were no blank lines. : https://stackoverflow.com/questions/74299947/how-to-assign-lines-from-a-text-file-to-different-kind-of-variables/74301075#74301075 — jdweng, Nov 04 '22 at 06:37
See edits in OP for log sample and explanation of shift. @Bill_Stewart, reading the file and excluding the blank lines is already fast in <1min. The loop is only processing the lines that matter. — Matthew McDonald, Nov 04 '22 at 16:16
@Darin yes, this is an ever appending log file, that eventually purges older lines as it's set with a max size — Matthew McDonald, Nov 04 '22 at 16:16
@Bill_Stewart I think I misunderstood your original post. You're suggesting changing the actual file so that log parser can parse it correctly. The problem here is not the blank/empty lines. It's the ever shifting lines due to delineation. Me re-writing the file to account for that (re-delineate) would take much longer, unless I'm still misunderstanding you. — Matthew McDonald, Nov 04 '22 at 19:26

score 2 · Accepted Answer · answered Nov 04 '22 at 06:24

Here is a more optimized version you can try.

What changed?:

Removed Write-Progress, especially because it's not known if Windows PowerShell is used. PowerShell versions below 6 have a big performance impact with Write-Progress
Changed $DNSCache to Generic Dictionary for fast lookups
Changed $LogEntries to Generic List
Switched from Get-Content to switch -Regex -File

$Logfile = 'C:\Temp\log.txt'

$LogEntries = [System.Collections.Generic.List[psobject]]::new()
$DNSCache   = [System.Collections.Generic.Dictionary[string, psobject]]::new([System.StringComparer]::OrdinalIgnoreCase)

# Process each log entry
switch -Regex -File ($Logfile) {
    '^\d+\/' {
        # Split line using space, including sequential spaces, as delimiter.  
        # NOTE:  Due to how app logs events, some fields may be blank leading split yielding different number of columns.  Fortunately the fields we desire
        #          are in static positions not affected by this, except for the last 2, which can be referenced backwards with -2 and -1.
        $temp = $_ -Split '\s+'
        $ip   = [string] $temp[8]

        $resolvedDNSRecord = $DNSCache[$ip]

        if ($null -eq $resolvedDNSRecord) {
            $resolvedDNSRecord = [PSCustomObject]@{
                IP      = $ip 
                DNSName = Resolve-DnsName $ip -QuickTimeout -DnsOnly -ErrorAction Ignore | select -ExpandProperty NameHost
            }
            $DNSCache[$ip] = $resolvedDNSRecord
        }

        $LogEntry = [PSCustomObject]@{
            Datetime      = $temp[0] + ' ' + $temp[1] + ' ' + $temp[2]          # Combines first 3 fields Date, Time, AM/PM
            ClientIP      = $ip
            ClientDNSName = $resolvedDNSRecord.DNSName
            QueryType     = $temp[-2]       # Second to last entry of array
            QueryName     = ($temp[-1] -Replace '\(\d+\)', '.') -Replace '^\.', ''        # Last entry of array.  Replace any "(#)" characters with period and remove first period for friendly name
        }

        # Add LogEntry to LogEntries collection
        $LogEntries.Add($LogEntry)
    }
}

If it's still slow, there is still the option to use Start-ThreadJob as a multithreading approach with chunked lines (like 10000 per job).

I'll give this a try. Thanks. I was thinking about a multithreaded approach as a possible solution if I could not speed the single threaded operation. FWIW, I am using WinPosh ISE for the last run. Are you saying Powershell 6 works much better with Write-Progress? — Matthew McDonald, Nov 04 '22 at 16:06
@MatthewMcDonald Yes. Please install the most recent version of PowerShell (currently 7.2.7) if possible and try it again. Windows PowerShell (5-) is deprecated and only receives security updates. — swbbl, Nov 04 '22 at 17:09
Ah. I do have 7.2.6 but wasn't using that as I was using the ISE so I could debug my code along the way. I don't think 7 has an ISE, no? — Matthew McDonald, Nov 04 '22 at 19:28
OK wow. I tried your code witn Pwsh7 and it takes only 1-2 minutes from start to finish. That's an astronomical improvement! With that said, as I did not understand how there could be such an improvement, I decided to test my original code, just without Write-Progress and within Pwsh7, and it is just as fast as yours. So I guess my entire issue was Write-Progress??? — Matthew McDonald, Nov 04 '22 at 20:38
FWIW, I added the write-progress back with yours, and it only added <30 seconds additional time. Crazy. I noticed in Pwsh7 the write-progress would drastically skip numbers on display, as a result of it processing so fast, but it seems the write-progress in WinPosh literally displayed every number in sequential order. Guess that's where the improvements came in Pwsh7. — Matthew McDonald, Nov 04 '22 at 20:38
@MatthewMcDonald Yes, seems `Write-Progress` was the main cause of it. Instead of ISE (which is also not maintained anymore) you should switch to `vscode` with the PowerShell extension (there is also an ISE mode/theme available). You can find the setup guide here:: https://learn.microsoft.com/en-us/powershell/scripting/dev-cross-plat/vscode/using-vscode — swbbl, Nov 07 '22 at 05:09

How to speed up processing of ~million lines of text in log file

1 Answers1