0

I'm a greenhorn in powershell but I have successfully created a script for processing a lot of .csv files. It open csv, look for value on certain line and export that at the finish into new .csv file.

It works but it is very slow (5000 csv takes like 5 hours) so I came up with optimized code - my idea was to read all (5000 csv) files into memory first and then do processing (looking for value and exporting it).

With 10 test .csv files it worked great (11x times faster than original code). But with 5'000 .csv files, it is 6x slower than original code.

I have already increased memory to 32GB for powershell (have 256GB in total) and also increased variable limit from 4096 to 32768 due to the dynamically generated variables.

Do you have any idea why my code works really good with only 10 csv files but is way slower with 5000 csv files?

Original (slow) code was opening csv every time it goes through the loop, the new (fast) code is loading them at once and them in loop, it is just loading csv from memory.

Maybe somebody will give me hint what is generally wrong in my optimization.

my code:

cd e:\temp6
$MaximumVariableCount = 32768
$path = ".\export2"
If(!(test-path $path))
{
      New-Item -ItemType Directory -Force -Path $path
}
(get-date).ToString('T')
$data = Import-Csv -Path .\000000.csv -Delimiter ';' 
$nodes_number = $data.Length
echo ("number of nodes:"+$nodes_number)
$data = $null
$freq_step = 1 
$final_freq = 5000 

$StartTime = $(get-date)
 #loading csv files into RAM to save time
 for ($freq = 0; $freq -le $final_freq; $freq=$freq+$freq_step)
        {
            $filename = $freq.ToString()+'.csv'
            $filename =  $filename.PadLeft(10, '0')
            $data = Import-Csv -Path $filename -Delimiter ';'
            Set-Variable -Name "var$freq" -Value $data -Scope global
            $data = $null
        }

$elapsedTime = $(get-date) - $StartTime
$totalTime = "{0:HH:mm:ss}" -f ([datetime]$elapsedTime.Ticks)
echo ("load time:"+ $totalTime)


#processing csv from RAM (already stored in RAM)
for ($i = 0; $i -lt $nodes_number; $i++) 
{ 

    echo ("finished:"+ [math]::Round(($i/$nodes_number*100),2) + "%")
    $OutArray = @()
    for ($freq = 0; $freq -le $final_freq; $freq=$freq+$freq_step)
        {
            #at this step I was reading from csv from harddrive but using my optimization
            # I'm now reading directly from RAM memory
            $data = Get-Variable -Name "var$freq" -ValueOnly
            $myobj = "" | Select "frequency", "amplitude"
            $myobj.frequency = $freq
            $myobj.amplitude = $data[$i].'Pressure [MPa]'
            $outarray += $myobj
            $myobj = $null

        }


    $exportname = $i.ToString()+'r.csv'
    $outarray | export-csv .\export2\$exportname -NoTypeInformation -Delimiter ';'
    $OutArray = $null


}



$elapsedTime = $(get-date) - $StartTime
$totalTime = "{0:HH:mm:ss}" -f ([datetime]$elapsedTime.Ticks)
echo ("total time ccc:"+ $totalTime)
echo ("total time ccc:"+ $totalTime) | Out-File .\info.txt -Append
Ondrej
  • 35
  • 1
  • 10
  • 1
    You could remove the `echo` from your `for` loop. I'm sure writing to the screen 5000 times isn't helping (It's definitely not why it's going so slow, but it's not helping, that's for sure) – retryW Jun 02 '20 at 06:35
  • 1
    Try to do everything in a pipeline and avoid assigning values to variables as that's expensive in processing terms. @Also don;t use `+=` as that's also expensive - every time oytu do that you're essentially creating a new array with the new value included and deleting the old one. Overall you're treating Powershell like batch scripting and that's never going to be the fastest way to approach it. – Scepticalist Jun 02 '20 at 07:13
  • hello that echo is not a issue - it write out every 5 minutes just one sentence. longest time is consuming that loop (frequency) for 5000 csv and there is no echo. what do you mean with "don't use +="? how should I rewrite that part of the code? with just =? and what do you mean to do everyting in a pipeline like take `$data = Get-Variable -Name "var$freq" -ValueOnly $myobj = "" | Select "frequency", "amplitude" $myobj.frequency = $freq $myobj.amplitude = $data[$i].'Pressure [MPa]' ... ` and connect it with | into one long line? – Ondrej Jun 02 '20 at 09:07
  • 1
    Try to [avoid using the increase assignment operator (`+=`) to create a collection](https://stackoverflow.com/q/60708578/1701026) – iRon Jun 02 '20 at 14:55

0 Answers0