1

I have a text file with a large number of log messages. I want to extract the messages between two string patterns. I want the extracted message to appear as it is in the text file.

I tried the following methods. It works, but doesn't support Get-Content's -Wait and -Tail options. Also, the extracted results are displayed in one line, but not like the text file. Inputs are welcome :-)

Sample Code

function GetTextBetweenTwoStrings($startPattern, $endPattern, $filePath){

    # Get content from the input file
    $fileContent = Get-Content $filePath

    # Regular expression (Regex) of the given start and end patterns
    $pattern = "$startPattern(.*?)$endPattern"

    # Perform the Regex opperation
    $result = [regex]::Match($fileContent,$pattern).Value

    # Finally return the result to the caller
    return $result
}

# Clear the screen
Clear-Host

$input = "THE-LOG-FILE.log"
$startPattern = 'START-OF-PATTERN'
$endPattern = 'END-OF-PATTERN'

# Call the function
GetTextBetweenTwoStrings -startPattern $startPattern -endPattern $endPattern -filePath $input

Improved script based on Theo's answer. The following points need to be improved:

  1. The beginning and end of the output is somehow trimmed despite I adjusted the buffer size in the script.
  2. How to wrap each matched result into START and END string?
  3. Still I could not figure out how to use the -Wait and -Tail options

Updated Script

# Clear the screen
Clear-Host

# Adjust the buffer size of the window
$bw = 10000
$bh = 300000
if ($host.name -eq 'ConsoleHost') # or -notmatch 'ISE'
{
  [console]::bufferwidth = $bw
  [console]::bufferheight = $bh
}
else
{
    $pshost = get-host
    $pswindow = $pshost.ui.rawui
    $newsize = $pswindow.buffersize
    $newsize.height = $bh
    $newsize.width = $bw
    $pswindow.buffersize = $newsize
}


function Get-TextBetweenTwoStrings ([string]$startPattern, [string]$endPattern, [string]$filePath){
    # Get content from the input file
    $fileContent = Get-Content -Path $filePath -Raw
    # Regular expression (Regex) of the given start and end patterns
    $pattern = '(?is){0}(.*?){1}' -f [regex]::Escape($startPattern), [regex]::Escape($endPattern)
    # Perform the Regex operation and output
    [regex]::Match($fileContent,$pattern).Groups[1].Value
}

# Input file path
 $inputFile = "THE-LOG-FILE.log"

# The patterns
$startPattern = 'START-OF-PATTERN'
$endPattern = 'END-OF-PATTERN'


Get-TextBetweenTwoStrings -startPattern $startPattern -endPattern $endPattern -filePath $inputFile
Sherzad
  • 405
  • 4
  • 14

2 Answers2

1

First of all, you should not use $input as self-defined variable name, because this is an Automatic variable.

Then, you are reading the file as a string array, where you would rather read is as a single, multiline string. For that append switch -Raw to the Get-Content call.

The regex you are creating does not allow fgor regex special characters in the start- and end patterns you give, so it I would suggest using [regex]::Escape() on these patterns when creating the regex string.

While your regex does use a group capturing sequence inside the brackets, you are not using that when it comes to getting the value you seek.

Finally, I would recommend using PowerShell naming convention (Verb-Noun) for the function name

Try

function Get-TextBetweenTwoStrings ([string]$startPattern, [string]$endPattern, [string]$filePath){
    # Get content from the input file
    $fileContent = Get-Content -Path $filePath -Raw
    # Regular expression (Regex) of the given start and end patterns
    $pattern = '(?is){0}(.*?){1}' -f [regex]::Escape($startPattern), [regex]::Escape($endPattern)
    # Perform the Regex operation and output
    [regex]::Match($fileContent,$pattern).Groups[1].Value
}

$inputFile    = "D:\Test\THE-LOG-FILE.log"
$startPattern = 'START-OF-PATTERN'
$endPattern   = 'END-OF-PATTERN'

Get-TextBetweenTwoStrings -startPattern $startPattern -endPattern $endPattern -filePath $inputFile

Would result in something like:

blahblah
more lines here

The (?is) makes the regex case-insensitive and have the dot match linebreaks as well


Nice to see you're using my version of the Get-TextBetweenTwoStrings function, however I believe you are mistaking the output in the console to output as in a dedicated text editor. In the console, too long lines will be truncated, whereas in a text editor like notepad, you can choose to wrap long lines or have a horizontal scrollbar.

If you simply append

| Set-Content -Path 'X:\wherever\theoutput.txt'

to the Get-TextBetweenTwoStrings .. call, you will find the lines are NOT truncated when you open it in Word or notepad for instance.

In fact, you can have that line folowed by

notepad 'X:\wherever\theoutput.txt'

to have notepad open that file straight away.

Theo
  • 57,719
  • 8
  • 24
  • 41
  • thanks for your inputs and please see my edited question for the existing issues. – Sherzad Mar 14 '22 at 16:06
  • 1
    @Sherzad, Theo's answer has good advice in general, but if you want to use `-Wait` , you _must_ use _line-by-line_ processing; multi-line matching with `-Raw` is then not an option. – mklement0 Mar 14 '22 at 16:16
  • @Sherzad Please see my edit. – Theo Mar 15 '22 at 11:48
  • @Theo, how to include the start and end patterns in the output as well, any idea? Even though this is the `` value for the `$endpattern`, nothing in the output. Of course, `` is not at the beginning of a line. – Sherzad Mar 16 '22 at 15:44
  • @Sherzad This looks more a XML file than a plain-text log file. Is that the case here? XML should not be processed like this and needs a different approach. – Theo Mar 16 '22 at 16:00
  • @Sherzad If you question indeed is how to extract the text from an XML node, I'd suggest you post a new question for that since this one has been answered already. Do not forget to include a representable example of the file in that new question. – Theo Mar 16 '22 at 16:14
1
  • You need to perform streaming processing of your Get-Content call, in a pipeline, such as with ForEach-Object, if you want to process lines as they're being read.

    • This is a must if you're using Get-Content -Wait, because such a call doesn't terminate by itself (it keeps waiting for new lines to be added to the file, indefinitely), but inside a pipeline its output can be processed as it is being received, even before the command terminates.
  • You're trying to match across multiple lines, which with Get-Content output would only work if you used the -Raw switch - by default, Get-Content reads its input file(s) line by line.

    • However, -Raw is incompatible with -Wait.
    • Therefore, you must stick with line-by-line processing, which requires that you match the start and end patterns separately, and keep track of when you're processing lines between those two patterns.

Here's a proof of concept, but note the following:

  • -Tail 100 is hard-coded - adjust as needed or make it another parameter.

  • The use of -Wait means that the function will run indefinitely - waiting for new lines to be added to $filePath - so you'll need to use Ctrl-C to stop it.

    • While you can use a Get-TextBetweenTwoStrings call itself in a pipeline for object-by-object processing, assigning its result to a variable ($result = ...) won't work when terminating with Ctrl-C, because this method of termination also aborts the assignment operation.

    • To work around this limitation, the function below is defined as an advanced function, which automatically enables support for the common -OutVariable parameter, which is populated even in the event of termination with Ctrl-C; your sample call would then look as follows (as Theo notes, don't use the automatic $input variable as a custom variable):

      # Look for blocks of interest in the input file, indefinitely,
      # and output them as they're being found.
      # After termination with Ctrl-C, $result will also contain the blocks
      # found, if any.
      Get-TextBetweenTwoStrings -OutVariable result -startPattern $startPattern -endPattern $endPattern -filePath $inputFile
      
  • Per your feedback you want the block of lines to encompass the full lines on which the start and end patterns match, so the regexes below are enclosed in .*

  • The word pattern in your $startPattern and $endPattern parameters is a bit ambiguous in that it suggests that they themselves are regexes that can therefore be used as-is or embedded as-is in a larger regex on the RHS of the -match operator.
    However, in the solution below I am assuming that they are be treated as literal strings, which is why they are escaped with [regex]::Escape(); simply omit these calls if these parameters are indeed regexes themselves; i.e.:

    $startRegex = '.*' + $startPattern + '.*'
    $endRegex = '.*' + $endPattern + '.*'
    
  • The solution assumes there is no overlap between blocks and that, in a given block, the start and end patterns are on separate lines.

  • Each block found is output as a single, multi-line string, using LF ("`n") as the newline character; if you want a CRLF newline sequences instead, use "`r`n"; for the platform-native newline format (CRLF on Windows, LF on Unix-like platforms), use [Environment]::NewLine.

# Note the use of "-" after "Get", to adhere to PowerShell's
# "<Verb>-<Noun>" naming convention.
function Get-TextBetweenTwoStrings {

  # Make the function an advanced one, so that it supports the 
  # -OutVariable common parameter.
  [CmdletBinding()]
  param(
    $startPattern, 
    $endPattern, 
    $filePath
  )

  # Note: If $startPattern and $endPattern are themselves
  #       regexes, omit the [regex]::Escape() calls.
  $startRegex = '.*' + [regex]::Escape($startPattern) + '.*'
  $endRegex = '.*' + [regex]::Escape($endPattern) + '.*'

  $inBlock = $false
  $block = [System.Collections.Generic.List[string]]::new()

  Get-Content -Tail 100 -Wait $filePath | ForEach-Object {
    if ($inBlock) {
      if ($_ -match $endRegex) {
        $block.Add($Matches[0])
        # Output the block of lines as a single, multi-line string
        $block -join "`n"
        $inBlock = $false; $block.Clear()       
      }
      else {
        $block.Add($_)
      }
    }
    elseif ($_ -match $startRegex) {
      $inBlock = $true
      $block.Add($Matches[0])
    }
  }

}
mklement0
  • 382,024
  • 64
  • 607
  • 775
  • thank you for the quick feedback. I appended `.*` and prepend `.*?` into the `$startPattern` and `$endPattern` but complete line for the `$endPattern` is not shown in the output. Another issue is that it captures and shows only the first match and after that it does not show any matches in the output. – Sherzad Mar 15 '22 at 06:59
  • Also, when I use `ctrl + c` to terminate the script, in the status bar I only the `stopping` and no more output in the console. FYI, I am using `PowerShell ISE`. – Sherzad Mar 15 '22 at 07:10
  • @Sherzad Re only capturing the first match: that was an oversight on my part, sorry: `$inBlock` needs to be reset to `$false` after finding the end pattern - please see my update, which also uses updated regexes to capture the _full_ lines. As for Ctrl-C: the execution environment (console vs. ISE) shouldn't matter; I've just tried in the ISE, and it works as intended. – mklement0 Mar 15 '22 at 07:31
  • @Sherzad As an aside: The ISE: It is [no longer actively developed](https://docs.microsoft.com/en-us/powershell/scripting/components/ise/introducing-the-windows-powershell-ise#support) and [there are reasons not to use it](https://stackoverflow.com/a/57134096/45375) (bottom section), notably not being able to run PowerShell (Core) 6+. The actively developed, cross-platform editor that offers the best PowerShell development experience is [Visual Studio Code](https://code.visualstudio.com/) with its [PowerShell extension](https://marketplace.visualstudio.com/items?itemName=ms-vscode.PowerShell). – mklement0 Mar 15 '22 at 07:33
  • 1
    @Sherzad, I've switched to using `[regex]::Escape()`, and added some more clarifications. From what I can tell, everything works as intended now, both in a regular console window and in the ISE. Again, note that you can NOT do `$result = Get-TextBetweenTwoStrings ...` and must use `Get-TextBetweenTwoStrings -OutVariable result ...` instead. – mklement0 Mar 15 '22 at 07:56