1

I'm attempting to work with large text files (500 MB - 2+ GB) that contain multi line events and sending them out VIA syslog. The script I have so far seems to work well for quite a while, but after a while it's causing ISE (64 bit) to not respond and use up all system memory.

I'm also curious if there's a way to improve the speed as the current script only sends to syslog at about 300 events per second.

Example Data

START--random stuff here 
more random stuff on this new line 
more stuff and things 
START--some random things 
additional random things 
blah blah 
START--data data more data 
START--things 
blah data

Code

Function SendSyslogEvent {

    $Server = '1.1.1.1'
    $Message = $global:Event
    #0=EMERG 1=Alert 2=CRIT 3=ERR 4=WARNING 5=NOTICE  6=INFO  7=DEBUG
    $Severity = '10'
    #(16-23)=LOCAL0-LOCAL7
    $Facility = '22'
    $Hostname= 'ServerSyslogEvents'
    # Create a UDP Client Object
    $UDPCLient = New-Object System.Net.Sockets.UdpClient
    $UDPCLient.Connect($Server, 514)
    # Calculate the priority
    $Priority = ([int]$Facility * 8) + [int]$Severity
    #Time format the SW syslog understands
    $Timestamp = Get-Date -Format "MMM dd HH:mm:ss"
    # Assemble the full syslog formatted message
    $FullSyslogMessage = "<{0}>{1} {2} {3}" -f $Priority, $Timestamp, $Hostname, $Message
    # create an ASCII Encoding object
    $Encoding = [System.Text.Encoding]::ASCII
    # Convert into byte array representation
    $ByteSyslogMessage = $Encoding.GetBytes($FullSyslogMessage)
    # Send the Message
    $UDPCLient.Send($ByteSyslogMessage, $ByteSyslogMessage.Length) | out-null
}

$LogFiles = Get-ChildItem -Path E:\Unzipped\

foreach ($File in $LogFiles){
    $EventCount = 0
    $global:Event = ''
    switch -Regex -File $File.fullname {
      '^START--' {  #Regex to find events
        if ($global:Event) {
            # send previous events' lines to syslog
            write-host "Send event to syslog........................."
            $EventCount ++
            SendSyslogEvent
        }
        # Current line is the start of a new event.
        $global:Event = $_
      }
      default { 
        # Event-interior line, append it.
        $global:Event += [Environment]::NewLine + $_
      }
    }
    # Process last block.
    if ($global:Event) { 
        # send last event's lines to syslog
        write-host "Send last event to syslog-------------------------"
        $EventCount ++
        SendSyslogEvent
    }
}
MrMr
  • 483
  • 1
  • 9
  • 25

2 Answers2

4

There are a couple of real-bad things in your script, but before we get to that let's have a look at how you can parameterize your syslog function.

Parameterize your functions

Scriptblocks and functions in powershell support optionally typed parameter declarations in the aptly named param-block.

For the purposes if this answer, let's focus exclusively on the only thing that ever changes when you invoke the current function, namely the message. If we turn that into a parameter, we'll end up with a function definition that looks more like this:

function Send-SyslogEvent {
    param(
        [string]$Message
    )

    $Server = '1.1.1.1'
    $Severity = '10'
    $Facility = '22'
    # ... rest of the function here
}

(I took the liberty of renaming it to PowerShell's characteristic Verb-Noun command naming convention).

There's a small performance-benefit to using parameters rather than global variables, but the real benefit here is that you're going to end up with clean and correct code, which will save you a headache for the rest.


IDisposable's

.NET is a "managed" runtime, meaning that we don't really need to worry about resource-management (allocating and freeing memory for example), but there are a few cases where we have to manage resources that are external to the runtime - such as network sockets used by an UDPClient object for example :)

Types that depend on these kinds of external resources usually implement the IDisposable interface, and the golden rule here is:

Who-ever creates a new IDisposable object should also dispose of it as soon as possible, preferably at latest when exiting the scope in which it was created.

So, when you create a new instance of UDPClient inside Send-SyslogEvent, you should also ensure that you always call $UDPClient.Dispose() before returning from Send-SyslogEvent. We can do that with a set of try/finally blocks:


function Send-SyslogEvent {
    param(
        [string]$Message
    )

    $Server = '1.1.1.1'
    $Severity = '10'
    $Facility = '22'
    $Hostname= 'ServerSyslogEvents'
    try{
        $UDPCLient = New-Object System.Net.Sockets.UdpClient
        $UDPCLient.Connect($Server, 514)

        $Priority = ([int]$Facility * 8) + [int]$Severity

        $Timestamp = Get-Date -Format "MMM dd HH:mm:ss"

        $FullSyslogMessage = "<{0}>{1} {2} {3}" -f $Priority, $Timestamp, $Hostname, $Message

        $Encoding = [System.Text.Encoding]::ASCII

        $ByteSyslogMessage = $Encoding.GetBytes($FullSyslogMessage)
        $UDPCLient.Send($ByteSyslogMessage, $ByteSyslogMessage.Length) | out-null
    }
    finally {
        # this is the important part
        if($UDPCLient){
            $UDPCLient.Dispose()
        }
    }
}

Failing to dispose of IDisposable objects is one of the surest way to leak memory and cause resource contention in the operating system you're running on, so this is definitely a must, especially for performance-sensitive or frequently invoked code.


Re-use instances!

Now, I showed above how you should handle disposal of the UDPClient, but another thing you can do is re-use the same client - you'll be connecting to the same syslog host every single time anyway!

function Send-SyslogEvent {
    param(
        [Parameter(Mandatory = $true)]
        [string]$Message,

        [Parameter(Mandatory = $false)]
        [System.Net.Sockets.UdpClient]$Client
    )

    $Server = '1.1.1.1'
    $Severity = '10'
    $Facility = '22'
    $Hostname= 'ServerSyslogEvents'
    try{
        # check if an already connected UDPClient object was passed
        if($PSBoundParameters.ContainsKey('Client') -and $Client.Available){
            $UDPClient = $Client
            $borrowedClient = $true
        }
        else{
            $UDPClient = New-Object System.Net.Sockets.UdpClient
            $UDPClient.Connect($Server, 514)
        }

        $Priority = ([int]$Facility * 8) + [int]$Severity

        $Timestamp = Get-Date -Format "MMM dd HH:mm:ss"

        $FullSyslogMessage = "<{0}>{1} {2} {3}" -f $Priority, $Timestamp, $Hostname, $Message

        $Encoding = [System.Text.Encoding]::ASCII

        $ByteSyslogMessage = $Encoding.GetBytes($FullSyslogMessage)
        $UDPCLient.Send($ByteSyslogMessage, $ByteSyslogMessage.Length) | out-null
    }
    finally {
        # this is the important part
        # if we "borrowed" the client from the caller we won't dispose of it 
        if($UDPCLient -and -not $borrowedClient){
            $UDPCLient.Dispose()
        }
    }
}

This last modification will allow us to create the UDPClient once and re-use it over and over again:

# ...
$SyslogClient = New-Object System.Net.Sockets.UdpClient
$SyslogClient.Connect($SyslogServer, 514)

foreach($file in $LogFiles)
{
    # ... assign the relevant output from the logs to $message, or pass $_ directly: 
    Send-SyslogEvent -Message $message -Client $SyslogClient 
    # ...
}

Use a StreamReader instead of a switch!

Finally, if you want to minimize allocations while slurping the files, for example use File.OpenText() to create a StreamReader to read the file line-by-line:

$SyslogClient = New-Object System.Net.Sockets.UdpClient
$SyslogClient.Connect($SyslogServer, 514)

foreach($File in $LogFiles)
{
    try{
        $reader = [System.IO.File]::OpenText($File.FullName)

        $msg = ''

        while($null -ne ($line = $reader.ReadLine()))
        {
            if($line.StartsWith('START--'))
            {
                if($msg){
                    Send-SyslogEvent -Message $msg -Client $SyslogClient
                }
                $msg = $line
            }
            else
            {
                $msg = $msg,$line -join [System.Environment]::NewLine
            }
        }
        if($msg){
            # last block
            Send-SyslogEvent -Message $msg -Client $SyslogClient
        }
    }
    finally{
        # Same as with UDPClient, remember to dispose of the reader.
        if($reader){
            $reader.Dispose()
        }
    }
}

This is likely going to be faster than the switch, although I doubt you'll see much improvement to the memory foot-print - simply because identical strings are interned in .NET (they're basically cached in a big in-memory pool).


Inspecting types for IDisposable

You can test if an object implements IDisposable with the -is operator:

PS C:\> $reader -is [System.IDisposable]
True

Or using Type.GetInterfaces(), as suggested by the TheIncorrigible1

PS C:\> [System.Net.Sockets.UdpClient].GetInterfaces()

IsPublic IsSerial Name
-------- -------- ----
True     False    IDisposable

I hope the above helps!

Mathias R. Jessen
  • 157,619
  • 12
  • 148
  • 206
  • 2
    Pro-tip about `IDisposable`, you can find all the interfaces of a type by calling the type's relevant method: `[Net.Sockets.UdpClient].GetInterfaces()` Also, validate nulls: `if ($null -ne $obj) { $obj.Dispose() }` – Maximilian Burszley Aug 21 '19 at 19:05
  • This appears to be working much faster, about 3x as fast. Thank you very much for taking the time to put this together, I learned so much! I'll have to wait and see if it finishes the file without memory issues, more to come. – MrMr Aug 21 '19 at 20:57
  • It's official, it worked! Finished a 2GB file 2.25 hours, unbelievable. Thank you again. – MrMr Aug 22 '19 at 12:36
-1

Here's an example of a way to switch over a file one line at a time.

get-content file.log | foreach { 
  switch -regex ($_) { 
    '^START--' { "start line is $_"} 
    default    { "line is $_" } 
  } 
}

Actually, I don't think switch -file is a problem. It seems to be optimized not to use too much memory according to "ps powershell" in another window. I tried it with a one gig file.

js2010
  • 23,033
  • 6
  • 64
  • 66