0

I am using PowerShell 2.0 on a Windows 7 desktop. I am attempting to search the enterprise CIFS shares for keywords/regex. I already have a simple single threaded script that will do this but a single keyword takes 19-22 hours. I have created a multithreaded script, first effort at multithreading, based on the article by Surly Admin.

Can Powershell Run Commands in Parallel?

Powershell Throttle Multi thread jobs via job completion

and the links related to those posts.

I decided to use runspaces rather than background jobs as the prevailing wisdom says this is more efficient. Problem is, is I am only getting partial resultant output with the multithreaded script I have. Not sure if it is an I/O thing or a memory thing, or something else. Hopefully someone here can help. Here is the code.

cls
Get-Date
Remove-Item C:\Users\user\Desktop\results.txt

$Throttle = 5 #threads

$ScriptBlock = {
    Param (
        $File
    )
    $KeywordInfo = Select-String -pattern KEYWORD -AllMatches -InputObject $File
    $KeywordOut = New-Object PSObject -Property @{
        Matches = $KeywordInfo.Matches
        Path = $KeywordInfo.Path
    }
    Return $KeywordOut
}

$RunspacePool = [RunspaceFactory]::CreateRunspacePool(1, $Throttle)
$RunspacePool.Open()
$Jobs = @()

$Files = Get-ChildItem -recurse -erroraction silentlycontinue
ForEach ($File in $Files) {
    $Job = [powershell]::Create().AddScript($ScriptBlock).AddArgument($File)
    $Job.RunspacePool = $RunspacePool
    $Jobs += New-Object PSObject -Property @{
        File = $File
        Pipe = $Job
        Result = $Job.BeginInvoke()
    }
}

Write-Host "Waiting.." -NoNewline
Do {
    Write-Host "." -NoNewline
    Start-Sleep -Seconds 1
} While ( $Jobs.Result.IsCompleted -contains $false)
Write-Host "All jobs completed!"

$Results = @()
ForEach ($Job in $Jobs) {
    $Results += $Job.Pipe.EndInvoke($Job.Result)
    $Job.Pipe.EndInvoke($Job.Result) | Where {$_.Path} | Format-List | Out-File -FilePath C:\Users\user\Desktop\results.txt -Append -Encoding UTF8 -Width 512
}

Invoke-Item C:\Users\user\Desktop\results.txt
Get-Date

This is the single threaded version I am using that works, including the regex I am using for socials.

cls
Get-Date

Remove-Item C:\Users\user\Desktop\results.txt

$files = Get-ChildItem -recurse -erroraction silentlycontinue

ForEach ($file in $files) {
    Select-String -pattern '[sS][sS][nN]:*\s*\d{3}-*\d{2}-*\d{4}' -AllMatches -InputObject $file | Select-Object matches, path |
        Format-List | Out-File -FilePath C:\Users\user\Desktop\results.tx -Append -Encoding UTF8 -Width 512
}

Get-Date
Invoke-Item C:\Users\user\Desktop\results.txt
Community
  • 1
  • 1
patient.0x00
  • 11
  • 1
  • 6
  • I see in your scriptblock `Return $SsnOut` but i dont see `$SsnOut` populated anywhere. Is is supposed to be `$KeywordOut` instead? – Matt Aug 25 '14 at 15:59
  • Correct, it should be $KeywordOut. I have edited the code to reflect the change. – patient.0x00 Aug 25 '14 at 16:44
  • So these files are enourmous? That is why it takes so long to process? Is that actually the pattern you are looking for `KEYWORD`? – Matt Aug 25 '14 at 18:08
  • No, looking for social security numbers, bad files (exe, ps1, bat, py, etc). Some of the files are large but it is also an enormous directory structure. I want it to be faster but not at the expense of missing files. – patient.0x00 Aug 25 '14 at 19:42
  • I use a program called fileseek for searching for credit card numbers and stuff. Maybe you could have a look at that. Also to stay in powershell world are you using regex for those queries? Maybe we can shave some time there. – Matt Aug 25 '14 at 19:53
  • Also heard of [False Sharing](http://stackoverflow.com/questions/8331255/false-sharing-and-pthreads) being a factor in multithreaded cases. – Matt Aug 25 '14 at 19:56
  • I am using regex for most of my queries. I will have to do some testing in a VM to see if the false sharing thing could be the issue. May have to ultimately get a commercial product but would really like to have this work (free!), but also just understand what the heck is going on that is causing files to be skipped when I use multithreading. – patient.0x00 Aug 25 '14 at 20:08
  • Is it possible your regex is incorrect and that is why it is skipping? Perhaps some whitespace there you dont know about. Also FileSeek and other are free... some just have pro options. – Matt Aug 25 '14 at 20:10
  • I don't think so because I get good results with the same regex when I just use a simple foreach loop. May not be the most efficient regex though. – patient.0x00 Aug 25 '14 at 20:17
  • If that is what you are using for looking for socials yes there can be improvements to that. Also `Get-ChildItem -recurse -erroraction silentlycontinue` does not refer to a file path to start at? Is that something you didnt include or is that on purpose? – Matt Aug 25 '14 at 20:24
  • That was on purpose I was just testing from the directory I was in. Planned on adding parameters that could be used at the CLI later on. – patient.0x00 Aug 25 '14 at 20:26
  • I edited in the single thread version I am using that works along with the regex I am using for social security numbers in the post above. – patient.0x00 Aug 25 '14 at 20:31
  • I'm not entirely sure about why you're missing results. You sure you're actualy missing results, and not just getting them in a different order? One thing that may help perf in both versions is to compile the regex once, and reuse it e.g. `$regex = New-Object Regex 'abc', 'Compiled'` `$regex.Matches('abcabc')` – Alex Aug 25 '14 at 20:53
  • I have some doubts about `$Jobs.Result.IsCompleted -contains $false`. If this condition works incorrectly then the loop is exited too soon and you may get no results. – Roman Kuzmin Aug 26 '14 at 05:53

2 Answers2

0

I am hoping to build this answer over time as I dont want to over comment. I dont know yet why you are losing data from the multithreading but i think we can increase performace with an updated regex. For starters you have many greedy quantifiers that i think we can shrink down.

[sS][sS][nN]:*\s*\d{3}-*\d{2}-*\d{4}

Select-String is case insensitive by default so you dont need the portion in the beginning. Do you have to check for multiple colons? Since you looking for 0 or many :. Same goes for the hyphens. Perhaps these would be better with ? which matches 0 or 1.

ssn:?\s*\d{3}-?\d{2}-?\d{4}

This is assuming you are looking for mostly proper formatted SSN's. If people are hiding them in text maybe you need to look for other delimiters as well.

I would also suggest adding the text to separate files and maybe combining them after execution. If nothing else just to test.

Hoping this will be the start of a proper solution.

Matt
  • 45,022
  • 8
  • 78
  • 119
  • I was going to respond with this, however I don't think the greediness matters here. .* is awful for performance as . matches everything causing the number of possibilities to explode, however the provided regex only uses -*, not .*, so we only match a sequence of 0 or more *dashes* – Alex Aug 25 '14 at 21:08
  • The other problem with this suggestion is it fails to match some things the previous regex didn't (which may or may not be a problem based on how clean the previous data was). E.g. the original regex matches double dashes used in eh number, whereas the suggested doesn't. (e.g. `'ssn: 123--12-1111'`). If the source data contains bad data with double dashes, or multiple spaces between the 'ssn' & the actual number, then stick with the original regex. Otherwise the above suggestion will be better. – Alex Aug 25 '14 at 21:10
  • There are multiple things that could be dont to account for variances. I was going for brevity based on my comment "This is assuming you are looking for mostly proper formatted SSN's." If nothing else you could even make the first part optional. `(ssn:?)?` Depends on what the Op is working with. Thanks for the bit about . vs - as far as greediness goes. – Matt Aug 25 '14 at 22:04
  • Thank you for the solution to the regex. I am fairly new to scripting and regex's so this should help performance. I will test tomorrow. The results I've seen only use one colon and may or may not have a hyphen, but some of the results do have multiple spaces, but it looks like @Matt regex solution will work for me. Not sure if it will add any light to the situation but when I run the multithreaded version I get about a 10%-15% of the results back and that drops even lower if I add more threads. – patient.0x00 Aug 26 '14 at 03:08
  • The above numbers are just estimations. I will get better data tomorrow. Thank you all. – patient.0x00 Aug 26 '14 at 03:35
  • What about using separate files? wonder if there are too many write commands to one file. Although it should queue them im curious. Something like FileMon from sysinternals might help to monitor requests to your file. It can seem like a complicated tool just to go into without experience. – Matt Aug 26 '14 at 03:36
  • I changed the code to just print the name of the $File and tested it with different number of threads and I am getting the same number of files each time and it is the correct number. I will have to do more testing, but I suspect I may be having trouble due to the select-string cmdlet. – patient.0x00 Aug 26 '14 at 14:51
0

It turns out that for some reason the Select-String cmdlet was having problems with the multithreading. I don't have enough of a developer background to be able to tell what is happening under the hood. However I did discover that by using the -quiet option in Select-String, which turns it into a boolean output, I was able to get the results I wanted.

The first pattern match in each document gives a true value. When I get a true then I return the Path of the document to an array. When that is finished I run the pattern match against the paths that were output from the scriptblock. This is not quite as effective performance wise as I had hoped for but still a pretty dramatic improvement over singlethread.

The other issue I ran into was the read/writes to disk by trying to output results to a document at each stage. I have changed that to arrays. While still memory intensive, it is much quicker.

Here is the resulting code. Any additional tips on performance improvement are appreciated:

cls
Remove-Item C:\Users\user\Desktop\output.txt

$Throttle = 5 #threads

$ScriptBlock = {
   Param (
      $File
   )
   $Match = Select-String -pattern 'ssn:?\s*\d{3}-?\d{2}-?\d{4}' -Quiet -InputObject $File
   if ( $Match -eq $true ) {
        $MatchObjects = Select-Object -InputObject $File
        $MatchOut = New-Object PSObject -Property @{
            Path = $MatchObjects.FullName
        }
   }
   Return $MatchOut
}

$RunspacePool = [RunspaceFactory]::CreateRunspacePool(1, $Throttle)
$RunspacePool.Open()
$Jobs = @()

$Files = Get-ChildItem -Path I:\ -recurse -erroraction silentlycontinue
ForEach ($File in $Files) {
   $Job = [powershell]::Create().AddScript($ScriptBlock).AddArgument($File)
   $Job.RunspacePool = $RunspacePool
   $Jobs += New-Object PSObject -Property @{
      File = $File
      Pipe = $Job
      Result = $Job.BeginInvoke()
   }
}

$Results = @()
ForEach ($Job in $Jobs) {
    $Results += $Job.Pipe.EndInvoke($Job.Result)
}

$PathValue = @()
ForEach ($Line in $Results) {
    $PathValue += $Line.psobject.properties | % {$_.Value}
}

$UniqValues = $PathValue  | sort | Get-Unique

$Output = ForEach ( $Path in $UniqValues ) {
    Select-String -Pattern '\d{3}-?\d{2}-?\d{4}' -AllMatches -Path $Path | Select-Object -Property Matches, Path
}

$Output | Out-File -FilePath C:\Users\user\Desktop\output.txt -Append -Encoding UTF8 -Width 512

Invoke-Item C:\Users\user\Desktop\output.txt
patient.0x00
  • 11
  • 1
  • 6