0

I have big files (at least 20 MB each) where i need to look for string M(\d{10})

Below is the script I am using:

Get-Content -Path Test.log | %{ [Regex]::Matches($_, "M(\d{10})") } | %{ $_.Value } | select -Unique

This is taking good time and more CPU, please suggest how to get the results with lower CPU usage/quicker.

San
  • 226
  • 5
  • 14
  • 1
    Probably You'd need to use .net methods to be efficient , have a look here https://stackoverflow.com/questions/9439210/how-can-i-make-this-powershell-script-parse-large-files-faster – Tomek May 08 '18 at 09:38

3 Answers3

2

Simply measure yourself ( to minimize cache effect differences the first is repeated):

Measure-Command {Get-Content -Path Test.log | %{ [Regex]::Matches($_, "M(\d{10})") } | %{ $_.Value } | select -Unique}

Measure-Command {Get-Content -Path Test.log | %{ [Regex]::Matches($_, "M(\d{10})") } | %{ $_.Value } | select -Unique}

Measure-Command {sls -Path Test.log  "M(\d{10})"  | %{ $_.Matches.Groups[1].Value } | select -Unique}
  • This is working good and fast: Measure-Command {sls -Path Test.log "M(\d{10})" | %{ $_.Matches.Groups[0].Value } | select -Unique} Thank you LotPings – San May 08 '18 at 11:21
1

Using the pipeline is (potentially) memory-efficient, but slow.

To speed up processing:

  • avoid the pipeline, but that is only an option if your data fits into memory as a whole - which shouldn't be a problem with 20 MB files.

  • separately, use .NET framework types and their methods directly, which is usually faster than using cmdlets.

Applying these insights to your scenario (PSv3+ syntax):

[regex]::Matches(
   [IO.File]::ReadAllText($PWD.ProviderPath + '/Test.log'), 
   'M\d{10}'
).Value | Select-Object -Unique

Note that for convenience the pipeline is still used, with Select-Object -Unique, in order to get the unique occurrences, but the assumption is that the bulk of the processing - extracting the regex matches - is in the optimized part of the statement.

mklement0
  • 382,024
  • 64
  • 607
  • 775
0

I would not use multiple times Foreach-Object but instead use Select-String:

(Get-Content -Path Test.log | Select-String "(?<=M)\d{10}").Matches.Value | select -Unique
iRon
  • 20,463
  • 10
  • 53
  • 79
  • Wouldn't that ouput the whole line with the match not just the capture group? –  May 08 '18 at 09:53
  • @LotPings, you are right, I have changed my answer accordingly. – iRon May 08 '18 at 16:33
  • Another thing that can speed up the is using -ReadCount 0 on Get-Content. This will read the hole file at ones instead of 1 line at the time. – Anders May 08 '18 at 19:01