5

I taught my self Powershell so I do not know everything about it.

I need to search a database with the exact amount of lines I have put in (the database is predefined), it contains > 11800 entries.

Can you please help me find what is making this slow?

Code:

$Dict = Get-Content "C:\Users\----\Desktop\Powershell Program\US.txt"

if($Right -ne "") {
    $Comb = $Letter + $Right
    $total = [int]0    
    $F = ""

    do {
        $F = $Dict | Select-Object -Index $total
        if($F.Length -eq $Num) {
            if($F.Chars("0") + $F.Chars("1") -eq $Comb) {
                Add-Content "C:\Users\----\Desktop\Powershell Program\Results.txt" "$F"
            }
        }
        $total++
        Write-Host $total
    } until([int]$total -gt [int]118619)

    $total = [int]0
    $F = ""
}

How do I speed this line by line searching/matching process up? Do I do by multi-threading? If so how?

sodawillow
  • 12,497
  • 4
  • 34
  • 44
Katie Rivera
  • 63
  • 1
  • 4
  • 1
    Welcome to Stack Overflow! I've posted an answer, as I feel that this is a valid question that is on-topic for the site, but please do read the guidelines about asking questions, and have a look at other questions that are well received. There is a lot of extra .. stuff, in your question that isn't related to the problem (which is probably why you were downvoted). Feel free to edit your question, and keep that in mind for future questions. – briantist Dec 06 '15 at 05:54
  • Tip : do not use 1 letter identifiers, they make the code unreadable to others – sodawillow Dec 06 '15 at 09:34
  • 1
    @KatieRivera You may have another, non-performance, bug. Your description says > 11800 entries, but your end condition is `-gt 118619`. If your description is indicative of the number of entries then $total will be off by more than a factor of 10 (and do a bunch of extra loop iterations in the process). `$dict.count` will have the actual number of entries. One other thing to note: `$dict` won't actually be a dictionary. A lot of PS and other .NET programmers would expect $dict to refer to a dictionary. – Χpẘ Dec 06 '15 at 22:25
  • As mentioned by @briantist, do not underestimate the PowerShell pipeline, if correctly implemented (like: `Get-Content .\In.txt | ForEach-Object { } | Set-Content .Out.txt`) it might appear faster if you take the complete solution. See also: [Fastest Way to get a uniquely index item from the property of an array](https://stackoverflow.com/a/59437162/1701026) – iRon Mar 16 '20 at 11:01

2 Answers2

14

It seems like you've known at least one other language before powershell, and are starting out by basically replicating what you might have done in another language in this one. That's a great way to learn a new language, but of course in the beginning you might end up with methods that are a bit strange or not performant.

So first I want to break down what your code is actually doing, as a rough overview:

  1. Read every line of the file at once and store it in the $Dict variable.
  2. Loop the same number of times as there are lines.
  3. In each iteration of the loop:
    1. Get the single line that matches the loop iteration (essentially through another iteration, rather than indexing, more on that later).
    2. Get the first character of the line, then the second, then combine them.
    3. If that's equal to a pre-determined string, append this line to a text file.

Step 3-1 is what's really slowing this down

To understand why, you need to know a little bit about pipelines in PowerShell. Cmdlets that accept and work on pipelines take one or more objects, but they process a single object at a time. They don't even have access to the rest of the pipeline.

This is also true for the Select-Object cmdlet. So when you take an array with 18,500 objects in it, and pipe it into Select-Object -Index 18000, you need to send in 17,999 objects for inspection/processing before it can give you the one you want. You can see how the time taken would get longer and longer the larger the index is.

Since you already have an array, you directly access any array member by index with square brackets [] like so:

$Dict[18000]

For a given array, that takes the same amount of time no matter what the index is.

Now for a single call to Select-Object -Index you probably aren't going to notice how long it takes, even with a very large index; the problem is that you're looping through the entire array already, so this is compounding greatly.

You're essentially having to do the sum of 1..18000 which is about or approximately 162,000,000 iterations! (thanks to user2460798 for correcting my math)

Proof

I tested this. First, I created an array with 19,000 objects:

$a = 1..19000 | %{"zzzz~$_"}

Then I measured both methods of accessing it. First, with select -index:

measure-command { 1..19000 | % { $a | select -Index ($_-1 ) } | out-null }

Result:

TotalMinutes      : 20.4383861316667
TotalMilliseconds : 1226303.1679

Then with the indexing operator ([]):

measure-command { 1..19000 | % { $a[$_-1] } | out-null }

Result:

TotalMinutes      : 0.00788774666666667
TotalMilliseconds : 473.2648

The results are pretty striking, it takes nearly 2,600 times longer to use Select-Object.

A counting loop

The above is the single thing causing your major slowdown, but I wanted to point out something else.

Typically in most languages, you would use a for loop to count. In PowerShell this would look like this:

for ($i = 0; $i -lt $total ; $i++) {
    # $i has the value of the iteration
}

In short, there are three statements in the for loop. The first is an expression that gets run before the loop starts. $i = 0 initializes the iterator to 0, which is the typical usage of this first statement.

Next is a conditional; this will be tested on each iteration and the loop will continue if it returns true. Here $i -lt $total compares checks to see that $i is less than the value of $total, some other variable defined elsewhere, presumably the maximum value.

The last statement gets executed on each iteration of the loop. $i++ is the same as $i = $i + 1 so in this case we're incrementing $i on each iteration.

It's a bit more concise than using a do/until loop, and it's easier to follow because the meaning of a for loop is well known.

Other Notes

If you're interested in more feedback about working code you've written, have a look at Code Review. Please read the rules there carefully before posting.

Community
  • 1
  • 1
briantist
  • 45,546
  • 6
  • 82
  • 127
  • 4
    It's not an answer, it's a book ! ^^ Well done, again – sodawillow Dec 06 '15 at 09:28
  • 2
    Great answer, props for exemplifying the impact with `Measure-Command`! Be aware though that the pipeline performance penalty applies to `| Out-Null` as well, slightly polluting the measurement. See my [`Test-NullPerformance`](https://gist.github.com/IISResetMe/610deb873a96f331039d) script for reference – Mathias R. Jessen Dec 06 '15 at 14:58
  • @MathiasR.Jessen That's a good point (and a great script for testing that btw). I figured the penalty would be equal among both methods though. `Select-Object` is only going to pass 1 object down the (already started) pipeline, the matching index, in this case. The array indexing will also pass a single object. Or did you mean that it just adds to the time all around, not that it magnifies the differences? – briantist Dec 06 '15 at 17:40
  • 2
    @briantist Good analysis of the problem. A small nit: doing 18000! iterations of anything, even if you did them one per picosecond, would take longer than the age of the universe (10^12 * 60 * 60 * 24 * 365.25 * 15 * 10^9, approximately 5 * 10^30 picoseconds). The representative calculation would be the _sum_ of 1..18000 which is about 18000^2/2 = 162,000,000. – Χpẘ Dec 06 '15 at 22:17
  • Oh really @user2460798 ?! Well.. you're right and I'm not the greatest at math (edited, thanks). :) – briantist Dec 07 '15 at 00:55
  • 1
    @briantist Excactly, it doesn't magnify, it adds the same overhead for both methods, meaning that the ratio is actually even bigger than you show, had `| Out-Null` not been used :) – Mathias R. Jessen Dec 07 '15 at 13:58
  • @MathiasR.Jessen ah, yeah that's true. I actually didn't expect `[]` to work _that_ fast. – briantist Dec 07 '15 at 17:43
  • I wonder what would be the results if comparing `[ ]` vs using the array's iterator? – Χpẘ Dec 07 '15 at 20:37
  • @user2460798 there's room for another answer comparing various methods (and taking into account the various nullifcation methods that Mathias mentioned) :-D – briantist Dec 07 '15 at 21:25
  • @briantist guess I should've seen "room for another answer" coming. :-) – Χpẘ Dec 08 '15 at 02:13
2

To my surprise using the array GetEnumerator is faster than indexing. It takes about 5/8 of the time of indexing. However this test is pretty unrealistic, in that the body of each loop is about as small as it can be.

$size = 64kb

$array = new int[] $size
# Initializing the array takes quite a bit of time compared to the loops below
0..($size-1) | % { $array[$_] = get-random}

write-host `n`nMeasure using indexing
[uint64]$sum = 0
Measure-Command {
  for ($ndx = 0; $ndx -lt $size; $ndx++) {
    $sum += $array[$ndx]
  }
}
write-host Average = ($sum / $size)

write-host `n`nMeasure using array enumerator
[uint64]$sum = 0
Measure-Command {
  foreach ($element in $array.GetEnumerator()) {
    $sum += $element
  }
}
write-host Average = ($sum / $size)



Measure using indexing


Days              : 0
Hours             : 0
Minutes           : 0
Seconds           : 0
Milliseconds      : 898
Ticks             : 8987213
TotalDays         : 1.04018668981481E-05
TotalHours        : 0.000249644805555556
TotalMinutes      : 0.0149786883333333
TotalSeconds      : 0.8987213
TotalMilliseconds : 898.7213

Average = 1070386366.9346


Measure using array enumerator
Days              : 0
Hours             : 0
Minutes           : 0
Seconds           : 0
Milliseconds      : 559
Ticks             : 5597112
TotalDays         : 6.47813888888889E-06
TotalHours        : 0.000155475333333333
TotalMinutes      : 0.00932852
TotalSeconds      : 0.5597112
TotalMilliseconds : 559.7112

Average = 1070386366.9346

Code for these two in assembler might look like

;       Using Indexing
mov     esi, <addr of array>
xor     ebx, ebx
lea     edi, <addr of $sum>
loop:
mov     eax, dword ptr [esi][ebx*4]
add     dword ptr [edi], eax
inc     ebx
cmp     ebx, 65536
jl      loop

;       Using enumerator
mov     esi, <addr of array>
lea     edx, [esi + 65356*4]
lea     edi, <addr of $sum>
loop:
mov     eax, dword ptr [esi]
add     dword ptr [edi], eax
add     esi, 4
cmp     esi, edx
jl      loop

The only difference is in the first mov instruction in the loop, with one using an index register and the other not. I kind of doubt that would explain the observed difference in speed. I guess the JITter must add additional overhead.

Χpẘ
  • 3,403
  • 1
  • 13
  • 22