0

I am using powershell script below to read and process one 17 MB text file. The input file contains around 200 000 rows and 12 columns. Currently the script takes almost 1 hour to process the input file. How to optimize the processing time?

Script:

$fields = Get-Content Temp.txt
$results = @()
foreach($i in $fields)
{
     $field   = $i -split '\t' -replace '^\s*|\s*$'
     $field1  = $field[0]
     $field2  = $field[1]
     $field3  = $field[2]
     $field4  = $field[3]
     $field5  = $field[4]
     $field6  = $field[5]
     $field7  = $field[6]
     $field8  = $field[7]
     $field9  = $field[8]
     $field10 = $field[9]
     $field11 = $field[10]
     $field12 = $field[11]

     if ($field1 -eq "4803" -and $field[2].substring(0,2) -eq "60")
     {
         $field2 = "5000000"
     }
     else
     {
         $field2 = $field[1]
     }
     $details = @{       
                Column1  = $field1
                Column2  = $field2
                Column3  = $field3
                Column4  = $field4
                Column5  = $field5
                Column6  = $field6
                Column7  = $field7
                Column8  = $field8
                Column9  = $field9
                Column10 = $field10
                Column11 = $field11
                Column12 = $field12
            }
    $results += New-Object PSObject -Property $details 
    }
$results | ForEach-Object { '{0}    {1} ... {11}' -f $_.Column1,$_. Column1,... $_.Column12 } | Set-Content -path Temp.txt
[Environment]::Exit(0)
James Z
  • 12,209
  • 10
  • 24
  • 44
AravindhK
  • 9
  • 4
  • well, why don't you start with importing this as csv? also the `else` condition is redundant. can you supply proper file for testing? you can replace all the information with something – 4c74356b41 Dec 02 '16 at 17:09
  • I have tested with import-csv also. I am facing the same issues. It is running for 1 hour. But It takes only 20 seconds for processing 10000 records. – AravindhK Dec 02 '16 at 17:13
  • Duplicate of [https://social.technet.microsoft.com/Forums/scriptcenter/en-US/c5bdc740-837a-43f6-97d5-b0f0d5bf22bc/](https://social.technet.microsoft.com/Forums/scriptcenter/en-US/c5bdc740-837a-43f6-97d5-b0f0d5bf22bc/). StackOverflow isn't a free code redesign service either. – Bill_Stewart Dec 02 '16 at 17:18
  • In addition to the suggestion of using `Import-Csv`, you should also look at what it means to use `Get-Content`. The most obvious way is not always the best. https://powershell.org/2013/10/21/why-get-content-aint-yer-friend/ – lit Dec 02 '16 at 18:07

2 Answers2

2

Unless I'm missing something here the goal is to take in tab delimited data, modify one field based on another, and then outputting as CSV data, correct? If so this one-liner should execute MUCH faster.

Import-Csv test.txt -Header @(1..12) -Delimiter `t | % {if(($($_.2) -eq "4803") -and($($_.3).substring(0,2) -eq "60")){$_.2 = "5000000"};$_} | export-csv test2.csv -NoTypeInformation

It avoids all the weird string parsing and gets around the biggest problem which is

$results += New-Object PSObject -Property $details

That line is copying your entire array into a new array for each line of your script, which is problematic for performance. the rest of the changes just make things slightly faster.

Mike Garuccio
  • 2,588
  • 1
  • 11
  • 20
  • I wonder if simply changing `$results` to `Arraylist` would improve performance a lot – 4c74356b41 Dec 02 '16 at 17:32
  • It might but AFAIK it would still require the copy for each row of the file (I have not tested this and could be wrong), you could improve the performance by setting $results = *the foreach loop* and then just getting rid of the $results+= inside the loop, which would assign the whole thing w/o doing any copies but it's still a whole lot of code to accomplish what a few lines manages w/o trouble. – Mike Garuccio Dec 02 '16 at 17:36
  • One of the big benefits of this suggestion is that it pipelines the data instead of requiring the entire large file to be read in at one time. – lit Dec 02 '16 at 18:11
  • where is the -replace '^\s*|\s*$' – Esperento57 Dec 02 '16 at 18:40
  • All that regex does is remove whitespace, which there's no need for when you use import-csv with the tab delimiter, if the input file has additional whitespace for some reason you could add `%{$_ -replace '^\s*|\s*$'}` as the second element of the pipeline – Mike Garuccio Dec 02 '16 at 18:48
0

If this were me, I would start to think about not using Get-Content if your files are going to get much bigger. Memory consumption will start to become an issue, and using Get-Content won't scale well if your files get really big because you pull everything into memory. And remember it will be more memory than the size of the file, because it has to represent things as objects (which is still smaller than an XML DOM, but regardless, it takes memory).

So first of all, you could loop through the input file using a stream reader, I have an example here: https://stackoverflow.com/a/32337282/380016

You can also write your output file using the stream writer, instead of concatting a big object like you are, only too loop through it and write it to a file at the end.

In the while loop of my example, you can still split the string as you are, if you want, do your manipulations, and then write it out. No need to accumulate it and wait to do it all at the end.

This approach should be faster and should use hardly any memory.

Community
  • 1
  • 1
campbell.rw
  • 1,366
  • 12
  • 22