4

The below code searches 400+ numbers from a list.txt file to see if it exists within any files within the folder path specified.

The script is very slow and has yet to complete as it did not complete after 25 minutes of running. The folder we are searching is 507 MB (532,369,408 bytes) and it contains 1,119 Files & 480 Folders. Any help to improve the speed of the search and the efficiency is greatly appreciated.

$searchWords = (gc 'C:\temp\list.txt') -split ','
$results = @()
Foreach ($sw in $searchWords)
{
    $files = gci -path 'C:\Users\david.craven\Dropbox\Asset Tagging\_SJC Warehouse_\_Project Completed_\2018\A*' -filter "*$sw*" -recurse

    foreach ($file in $files)
    {
        $object = New-Object System.Object
        $object | Add-Member -Type NoteProperty –Name SearchWord –Value $sw
        $object | Add-Member -Type NoteProperty –Name FoundFile –Value $file.FullName
        $results += $object
    }

}

$results | Export-Csv C:\temp\output.csv -NoTypeInformation
dcraven
  • 139
  • 4
  • 16
  • 2
    Are you trying to look for `$sw` from file contents? The question sounds like you do, but the script looks only file names. – vonPryz Nov 01 '18 at 23:16
  • 1
    You read all 1,100 files in their entirety looking for each of 400 words! Can this crazy language maybe search for any of, say 10 words at a time? Then you'd only need 40 passes over 1,100 files and it would be 10 times faster. Do you have to keep searching a document if you find a number, or can you exit on first match? Does this crazy language allow parallelisation? Can you use Linux instead of this thing? – Mark Setchell Nov 01 '18 at 23:26
  • 3
    Take a look at [Select-String](https://docs.microsoft.com/en-us/powershell/module/microsoft.powershell.utility/select-string?view=powershell-6), which can use regular expressions for more efficient matching. Also, you might be more efficient to get all the filenames first then check them in memory rather than multiple calls to `Get-ChildItem`. Finally, try using the [PsCustomObject](https://devops-collective-inc.gitbook.io/the-big-book-of-powershell-gotchas/new-object-psobject-vs.-pscustomobject) method rather than `New-Object`/`Add-Member` as the pipeline might be slowing things down. – boxdog Nov 01 '18 at 23:44
  • 2
    @MarkSetchell of course. `select-string` is the analog of `grep` in powershell, and it can search multiple patterns as well as regex – phuclv Nov 02 '18 at 02:29
  • 2
    _If you have a working piece of code from your project and are looking for open-ended feedback in the areas: Best practices and design pattern usage, Security issues, *Performance*, Correctness in unanticipated cases_ - Then [Code Review SE](https://codereview.stackexchange.com) is the right place to ask questions. Can someone please move this question? I am not able to. – Nikhil Vartak Nov 02 '18 at 02:43
  • += kills puppies. – js2010 Oct 09 '21 at 13:44

3 Answers3

8

The following should speed up your task substantially:

If the intent is truly to look for the search words in the file names:

$searchWords = (Get-Content 'C:\temp\list.txt') -split ','
$path = 'C:\Users\david.craven\Dropbox\Facebook Asset Tagging\_SJC Warehouse_\_Project Completed_\2018\A*'

Get-ChildItem -File -Path $path -Recurse -PipelineVariable file |
  Select-Object -ExpandProperty Name |
    Select-String -SimpleMatch -Pattern $searchWords |
      Select-Object @{n='SearchWord'; e='Pattern'},
                    @{n='FoundFile'; e={$file.FullName}} |
        Export-Csv C:\temp\output.csv -NoTypeInformation

If the intent is to look for the search words in the files' contents:

$searchWords = (Get-Content 'C:\temp\list.txt') -split ','
$path = 'C:\Users\david.craven\Dropbox\Facebook Asset Tagging\_SJC Warehouse_\_Project Completed_\2018\A*'

Get-ChildItem -File -Path $path -Recurse |
  Select-String -List -SimpleMatch -Pattern $searchWords |
    Select-Object @{n='SearchWord'; e='Pattern'},
                  @{n='FoundFile'; e='Path'} |
      Export-Csv C:\temp\output.csv -NoTypeInformation

The keys to performance improvement:

  • Perform the search with a single command, by passing all search words to Select-String. Note: -List limits matching to 1 match (by any of the given patterns).

  • Instead of constructing custom objects in a script block with New-Object and Add-Member, let Select-Object construct the objects for you directly in the pipeline, using calculated properties.

  • Instead of building an intermediate array iteratively with += - which behind the scenes recreates the array every time - use a single pipeline to pipe the result objects directly to Export-Csv.

mklement0
  • 382,024
  • 64
  • 607
  • 775
  • 3
    Nice! I always forget about -PipelineVariable! – Matt McNabb Nov 02 '18 at 02:39
  • 1
    Thanks, @MattMcNabb. It's a handy feature, but the need for it doesn't arise too often, so it's hard to remember. – mklement0 Nov 02 '18 at 02:40
  • Thanks @MattMcNabb for that great explanation. I am seeing the below error unfortunately. `Select-String : Cannot bind argument to parameter 'Pattern' because it is an empty string. At C:\Users\david.craven\Downloads\test.ps1:5 char:39 + Select-String -SimpleMatch -Pattern $searchWords | + ~~~~~~~~~~~~ + CategoryInfo : InvalidData: (:) [Select-String], ParameterBindin gValidationException + FullyQualifiedErrorId : ParameterArgumentValidationErrorEmptyStringNotAll owed,Microsoft.PowerShell.Commands.SelectStringCommand ` – dcraven Nov 02 '18 at 03:33
  • 2
    @dcraven: That suggests that `$searchWords` is empty rather than containing your search words. – mklement0 Nov 02 '18 at 03:42
  • @mklement0 - I am not sure how as it has over 500 different values i.e FOC2223NHZB, FOC2223NHZ4, FOC2214N235, FOC2223NJ01, – dcraven Nov 02 '18 at 03:50
  • @dcraven: Maybe in a different variable name - typo? You can recreate the problem with `'input' | Select-String $NoSuchVariable` vs. `'input' | Select-String 'in', 'put'` – mklement0 Nov 02 '18 at 04:01
  • As the 2nd script would output multiple occurences in a file without distinguishing between them, I'd add another calculated property to the Select-Object `@{n='Line';e={"{0,5}:{1}" -f $_.LineNumber,$_.Line}}` or otherwise add the `-Unique`switchparameter (+1) –  Nov 02 '18 at 11:19
  • 1
    Good point about multiple matches, @LotPings. For simplicity I decided to add `-List` to `Select-String`, which limits matching to at most 1 occurrence. – mklement0 Nov 02 '18 at 12:27
1

So there are definitely some basic things in the PowerShell code you posted that can be improved, but it may still not be super fast. Based on the sample you gave us I'll assume you're looking to match the file names against a list of words. You're looping through the list of words (400 iterations) and in each loop you're looping through all 1,119 files. That's a total of 447,600 iterations!

Assuming you can't reduce the number of iterations in the loop, let's start by making each iteration faster. The Add-Member cmdlet is going to be really slow, so switch that approach up by casting a hashtable to the [PSCustomObject] type accelerator:

[PSCustomObject]@{
    SearchWord = $Word
    File       = $File.FullName
}

Also, there is no reason to pre-create an array object and then add each file to it. You can simply capture the ouptut of the foreach loop in a variable:

$Results = Foreach ($Word in $Words)
{
...

So a faster loop might look like this:

$Words = Get-Content -Path $WordList
$Files = Get-ChildItem -Path $Path -Recurse -File

$Results = Foreach ($Word in $Words)
{    
    foreach ($File in $Files)
    {
        if ($File.BaseName -match $Word)
        {
            [PSCustomObject]@{
                SearchWord = $Word
                File       = $File.FullName
            }
        }
    }
}

A simpler approach might be to use Where-Object on the files array:

$Results = Foreach ($Word in $Words)
{
    $Files | Where-Object BaseName -match $Word
}

Try both and test out the performance.

Matt McNabb
  • 362
  • 4
  • 15
0

So if speeding up the loop doesn't meet your needs, try removing the loop entirely. You could use regex and join all the words together:

$Words = Get-Content -Path $WordList
$Files = Get-ChildItem -Path $Path -Recurse -File
$WordRegex = $Words -join '|'
$Files | Where basename -match $WordRegex
Matt McNabb
  • 362
  • 4
  • 15