2

I need to create a script to search through just below a million files of text, code, etc. to find matches and then output all hits on a particular string pattern to a CSV file.

So far I made this;

$location = 'C:\Work*'

$arr = "foo", "bar" #Where "foo" and "bar" are string patterns I want to search for (separately)

for($i=0;$i -lt $arr.length; $i++) {
Get-ChildItem $location -recurse | select-string -pattern $($arr[$i]) | select-object Path | Export-Csv "C:\Work\Results\$($arr[$i]).txt"
}

This returns to me a CSV file named "foo.txt" with a list of all files with the word "foo" in it, and a file named "bar.txt" with a list of all files containing the word "bar".

Is there any way anyone can think of to optimize this script to make it work faster? Or ideas on how to make an entirely different, but equivalent script that just works faster?

All input appreciated!

cc0
  • 1,960
  • 7
  • 40
  • 57
  • 1
    How much does it take now (just out of curiosity)? Do you need only file paths that contain matches in the output? – Roman Kuzmin Jan 11 '11 at 12:18
  • Now it takes ~2 hours pr item in the array. I just learned the measure-command trick a bit ago, I'll see if performance increases as the process gets cached. -- I do only need file paths that contain matches, yes – cc0 Jan 11 '11 at 12:24
  • I can also add that the length of each array item (string) seems to significantly affect processing time. CPU usage was around 15-20% during the first run-through. Now it seems to be around 4-5%. Interesting stuff. – cc0 Jan 11 '11 at 12:26
  • 1
    Are your files small enough, e.g. to read all text into memory, or is this not an option? – Roman Kuzmin Jan 11 '11 at 12:27
  • The total sum of files would be too big, but that is an interesting thought. If I could cache it all in RAM I would be willing to split the operation and cache one subdirectory at a time before performing the search. Do you have any idea as far as how to implement that? – cc0 Jan 11 '11 at 12:45

2 Answers2

2

Let's suppose that 1) the files are not too big and you can load it into memory, 2) you really just want the Path of the file, that matches (not the line etc.).

I tried to read the file only once and then iterate through the regexes. There is some gain (it's a faster then the original solution), but the final result will depend on other factors like file sizes, count of files etc.

Also removing 'ignorecase' makes it faster a little bit.

$res = @{}
$arr | % { $res[$_] = @() }

Get-ChildItem $location -recurse | 
  ? { !$_.PsIsContainer } |
  % { $file = $_
      $text = [Io.File]::ReadAllText($file.FullName)
      $arr | 
        % { $regex = $_
            if ([Regex]::IsMatch($text, $regex, 'ignorecase')) {
              $res[$regex] = $file.FullName
            }
        }
  }
$res.GetEnumerator() | % { 
  $_.Value | Export-Csv "d:\temp\so-res$($_.Key).txt"
}
stej
  • 28,745
  • 11
  • 71
  • 104
  • Thank you :) I will give this a shot also and see which is faster for my situation. Should be interesting! – cc0 Jan 11 '11 at 12:57
  • I'll do that as soon as I have them :] Might take a couple of days, I'll do some proper testing here with many items. – cc0 Jan 11 '11 at 13:05
2

If your files are not huge and can be read into memory then this version should work quite faster (and my quick and dirty local test seems to prove that):

$location = 'C:\ROM'
$arr = "Roman", "Kuzmin"

# remove output files
foreach($test in $arr) {
    Remove-Item ".\$test.txt" -ErrorAction 0 -Confirm
}

Get-ChildItem $location -Recurse | .{process{ if (!$_.PSIsContainer) {
    # read all text once
    $content = [System.IO.File]::ReadAllText($_.FullName)
    # test patterns and output paths once
    foreach($test in $arr) {
        if ($content -match $test) {
            $_.FullName >> ".\$test.txt"
        }
    }
}}}

Notes: 1) mind changed paths and patterns in the example; 2) output files are not CSV but plain text; there is not much reason in CSV if you are interested just in paths - plain text files one path per line will do.

Roman Kuzmin
  • 40,627
  • 11
  • 95
  • 117