2

So im writing script in Powershell for assignment and im suposed to replace words in string using 1,2,4 and 8 threads. Im using Start-Job and Wait-Job for threading. In this code i used just a short string but I will be doing this with 5000 word string 10 000 and 20 000 word string. Problem is that when im using 1 thread it runs in ~700ms and more threads I use longer time I get, for example when using 8 threads I get ~1800ms. I guess theres something wrong with my threading but Im complete amateur so I dont know what.

$inputString = "crush, deal, story, clap, early, pagan, fan, avian"
$substringToReplace = "crush"
$replacementSubstring = "red"

# number of jobs
$numJobs = 1

# spliting string to substrings for jobs
$words = $inputString -split " "
$numWordsPerSubstring = [Math]::round($words.Length / $numJobs)

$substrings = @()

for ($i = 0; $i -lt $numJobs; $i++) {
    $startIndex = $i * $numWordsPerSubstring
    $endIndex = [Math]::Min(($startIndex + $numWordsPerSubstring - 1), ($words.Length - 1))
    $substrings += ($words[$startIndex..$endIndex] -join " ") + " "
}

# scriptblock for jobs
$scriptBlock = {
    param($substring, $substringToReplace, $replacementSubstring)
    $substring -replace $substringToReplace, $replacementSubstring
}

$startTime = [Math]::Round((Get-Date).ToFileTime()/10000)
Write-Host "Start time is $startTime"

# starting each job
$jobs = foreach ($substring in $substrings) {
    #Write-Host "Job starte with substring $substring"
    Start-Job -ScriptBlock $scriptBlock -ArgumentList $substring, $substringToReplace, $replacementSubstring
}

# waiting for jobs to finnish
$outputString = ""
foreach ($job in $jobs) {
    #Write-Host "Job $job ended"
    $outputString += Wait-Job $job | Receive-Job
}

$endTime = [Math]::Round((Get-Date).ToFileTime()/10000)
Write-Host "End time is $endTime"

Write-Host "It took $($endTime - $startTime) milliseconds"

Maybe it just takes more time to synchronize more threads Im not sure like i said im complete amateur in Powershell.

  • 2
    `Start-Job` uses parallelism based on _child processes_, which is both slow and resource-intensive. In recent PowerShell versions, much faster _thread_-based parallelism via `Start-ThreadJob`, from the `ThreadJob` module, is available, especially in _PowerShell (Core) 7+_, which ships with that module - see [this answer](https://stackoverflow.com/a/56612574/45375). – mklement0 Mar 19 '23 at 14:13
  • I assume the input string is in reality thousand of times larger? Else there is no point in multithreading – Santiago Squarzon Mar 19 '23 at 14:41
  • 2
    For reference, multithreading starts to become relevant at `70000000` words for me – Santiago Squarzon Mar 19 '23 at 15:15
  • the answer depends on the number of cores in your microprocessor. Your code will run faster adding more threads until you exceed the number of cores, than the execution times gains will stop. – jdweng Mar 19 '23 at 19:42
  • 1
    @mklement0 thank you, once i changed to Start-ThreadJob it was much quicker. – Šimon Krížo Mar 20 '23 at 06:29
  • @SantiagoSquarzon I wanted to test that on 5000, 10000, and 20000 words but i guess with that i wouldnt get much of a different times – Šimon Krížo Mar 20 '23 at 06:31
  • for so little words its not worth using multithreading, as demonstrated in my answer, you need a really big dataset for it to become relevant. at least in powershell. in c# it might probably become relevant earlier. – Santiago Squarzon Mar 20 '23 at 13:58

1 Answers1

1

To make this tests relevant you will need a much bigger string than 20k words, to put it into perspective, multithreading becomes relevant repeating your input string around 2 million times and even then I wouldn't trust these timings. Also Start-Job is not the right cmdlet to do multithreading with. You want to use cmdlets that make use of Runspace. Either Start-ThreadJob or ForEach-Object -Parallel are viable options, or you can code it yourself.

# repeat this string 2 million times
$inputString = "crush, deal, story, clap, early, pagan, fan, avian" * 2000000
$words = $inputString.Split(' ')

$substringToReplace = "crush"
$replacementSubstring = "red"

$sb = {
    param($substring, $substringToReplace, $replacementSubstring)
    # No need to use regex based replacement operator here.
    $substring.Replace($substringToReplace, $replacementSubstring)
}

# 1, 2, 4 and 8 threads for this comparison
foreach($thread in 1, 2, 4, 8) {
    $numWordsPerSubstring = [Math]::round($words.Length / $thread)
    $substrings = for ($i = 0; $i -lt $thread; $i++) {
        $startIndex = $i * $numWordsPerSubstring
        $endIndex = [Math]::Min($startIndex + $numWordsPerSubstring - 1, $words.Length - 1)
        Write-Output -NoEnumerate ($words[$startIndex..$endIndex] -join " " + " ")
    }

    try {
        $time = Measure-Command {
            $iss = [initialsessionstate]::CreateDefault2()
            $pool = [runspacefactory]::CreateRunspacePool(1, $thread, $iss, $Host)
            $pool.Open()
            $jobs = foreach($substring in $substrings) {
                $ps = [powershell]::Create().AddScript($sb).AddParameters(@{
                    subString = $substring
                    substringToReplace = $substringToReplace
                    replacementSubstring = $replacementSubstring
                })
                $ps.RunspacePool = $pool

                @{ Instance = $ps; Async = $ps.BeginInvoke() }
            }

            $result = [System.Text.StringBuilder]::new()
            $jobs | ForEach-Object { $result.Append($_.Instance.EndInvoke($_.Async)[0]) }
            $result.ToString()
        }

        [pscustomobject]@{
            ThreadCount  = $thread
            Milliseconds = $time.TotalMilliseconds
        }
    }
    finally {
        $jobs.Instance | ForEach-Object Dispose
        $pool.Dispose()
    }
}

Looking at the results of above tests we can see that 8 threads might get the job done quicker, but again, you should not draw conclusions from these tests, they're not accurate. By simply running these tests multiple times you will understand what I mean. Micro-benchmarking in PowerShell cannot be trusted.

ThreadCount Milliseconds
----------- ------------
          1      8282.92
          2      5140.20
          4      2406.71
          8      1218.62
Santiago Squarzon
  • 41,465
  • 5
  • 14
  • 37