Get Unique records from large files quickly

Question

I have big files (at least 20 MB each) where i need to look for string M(\d{10})

Below is the script I am using:

Get-Content -Path Test.log | %{ [Regex]::Matches($_, "M(\d{10})") } | %{ $_.Value } | select -Unique

This is taking good time and more CPU, please suggest how to get the results with lower CPU usage/quicker.

Probably You'd need to use .net methods to be efficient , have a look here https://stackoverflow.com/questions/9439210/how-can-i-make-this-powershell-script-parse-large-files-faster — Tomek, May 08 '18 at 09:38

score 2 · Accepted Answer · 2018-05-08T09:48:42.553

2

Simply measure yourself ( to minimize cache effect differences the first is repeated):

Measure-Command {Get-Content -Path Test.log | %{ [Regex]::Matches($_, "M(\d{10})") } | %{ $_.Value } | select -Unique}

Measure-Command {Get-Content -Path Test.log | %{ [Regex]::Matches($_, "M(\d{10})") } | %{ $_.Value } | select -Unique}

Measure-Command {sls -Path Test.log  "M(\d{10})"  | %{ $_.Matches.Groups[1].Value } | select -Unique}

edited May 08 '18 at 09:48

answered May 08 '18 at 09:42

This is working good and fast: Measure-Command {sls -Path Test.log "M(\d{10})" | %{ $_.Matches.Groups[0].Value } | select -Unique} Thank you LotPings – San May 08 '18 at 11:21

score 1 · Answer 2 · answered Jul 04 '18 at 21:20

Using the pipeline is (potentially) memory-efficient, but slow.

To speed up processing:

avoid the pipeline, but that is only an option if your data fits into memory as a whole - which shouldn't be a problem with 20 MB files.
separately, use .NET framework types and their methods directly, which is usually faster than using cmdlets.

Applying these insights to your scenario (PSv3+ syntax):

[regex]::Matches(
   [IO.File]::ReadAllText($PWD.ProviderPath + '/Test.log'), 
   'M\d{10}'
).Value | Select-Object -Unique

Note that for convenience the pipeline is still used, with Select-Object -Unique, in order to get the unique occurrences, but the assumption is that the bulk of the processing - extracting the regex matches - is in the optimized part of the statement.

iRon · Answer 3 · 2018-05-08T16:32:15.513

0

I would not use multiple times Foreach-Object but instead use Select-String:

(Get-Content -Path Test.log | Select-String "(?<=M)\d{10}").Matches.Value | select -Unique

edited May 08 '18 at 16:32

answered May 08 '18 at 09:46

iRon

20,463
10
53
79

Wouldn't that ouput the whole line with the match not just the capture group? – May 08 '18 at 09:53
@LotPings, you are right, I have changed my answer accordingly. – iRon May 08 '18 at 16:33
Another thing that can speed up the is using -ReadCount 0 on Get-Content. This will read the hole file at ones instead of 1 line at the time. – Anders May 08 '18 at 19:01

Get Unique records from large files quickly

3 Answers3