14

I have the following PowerShell script that will parse some very large file for ETL purposes. For starters my test file is ~ 30 MB. Larger files around 200 MB are expected. So I have a few questions.

The script below works, but it takes a very long time to process even a 30 MB file.

PowerShell Script:

$path = "E:\Documents\Projects\ESPS\Dev\DataFiles\DimProductionOrderOperation"
$infile = "14SEP11_ProdOrderOperations.txt"
$outfile = "PROCESSED_14SEP11_ProdOrderOperations.txt"
$array = @()

$content = gc $path\$infile |
    select -skip 4 |
    where {$_ -match "[|].*[|].*"} |
    foreach {$_ -replace "^[|]","" -replace "[|]$",""}

$header = $content[0]

$array = $content[0]
for ($i = 1; $i -le $content.length; $i+=1) {
    if ($array[$i] -ne $content[0]) {$array += $content[$i]}
}

$array | out-file $path\$outfile -encoding ASCII

DataFile Excerpt:

---------------------------
|Data statistics|Number of|
|-------------------------|
|Records passed |   93,118|
---------------------------
02/14/2012                                                                                                                                                           Production Operations and Confirmations                                                                                                                                                              2
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Production Operations and Confirmations
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|ProductionOrderNumber|MaterialNumber                       |ModifiedDate|Plant|OperationRoutingNumber|WorkCenter|OperationStatus|IsActive|     WbsElement|SequenceNumber|OperationNumber|OperationDescription                    |OperationQty|ConfirmedYieldQty|StandardValueLabor|ActualDirectLaborHrs|ActualContractorLaborHrs|ActualOvertimeLaborHrs|ConfirmationNumber|
|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|180849518            |011255486L1                          |02/08/2012  |2101 |            9901123118|56B30     |I9902          |        |SOC10MA2302SOCJ31|              |0140           |Operation 1                             |          1 |               0 |              0.0 |                    |                499.990 |                      |        9908651250|
|180849518            |011255486L1                          |02/08/2012  |2101 |            9901123118|56B30     |I9902          |        |SOC10MA2302SOCJ31|14            |9916           |Operation 2                             |          1 |               0 |            499.0 |                    |                        |                      |        9908532289|
|181993564            |011255486L1                          |02/09/2012  |2101 |            9901288820|56B30     |I9902          |        |SOC10MD2302SOCJ31|14            |9916           |Operation 1                             |          1 |               0 |            499.0 |                    |                399.599 |                      |        9908498544|
|180885825            |011255486L1                          |02/08/2012  |2101 |            9901162239|56B30     |I9902          |        |SOC10MG2302SOCJ31|              |0150           |Operation 3                             |          1 |               0 |              0.0 |                    |                882.499 |                      |        9908099659|
|180885825            |011255486L1                          |02/08/2012  |2101 |            9901162239|56B30     |I9902          |        |SOC10MG2302SOCJ31|14            |9916           |Operation 4                             |          1 |               0 |            544.0 |                    |                        |                      |        9908858514|
|181638583            |990104460I0                          |02/10/2012  |2101 |            9902123289|56G99     |I9902          |        |SOC11MAR105SOCJ31|              |0160           |Operation 5                             |          1 |               0 |          1,160.0 |                    |                        |                      |        9914295010|
|181681218            |990104460B0                          |02/08/2012  |2101 |            9902180981|56G99     |I9902          |        |SOC11MAR328SOCJ31|0             |9910           |Operation 6                             |          1 |               0 |            916.0 |                    |                        |                      |        9914621885|
|181681036            |990104460I0                          |02/09/2012  |2101 |            9902180289|56G99     |I9902          |        |SOC11MAR108SOCJ31|              |0180           |Operation 8                             |          1 |               0 |              1.0 |                    |                        |                      |        9914619196|
|189938054            |011255486A2                          |02/10/2012  |2101 |            9999206805|5AD99     |I9902          |        |RS08MJ2305SOCJ31 |              |0599           |Operation 8                             |          1 |               0 |              0.0 |                    |                        |                      |        9901316289|
|181919894            |012984532A3                          |02/10/2012  |2101 |            9902511433|A199399Z  |I9902          |        |SOC12MCB101SOCJ31|0             |9935           |Operation 9                             |          1 |               0 |              0.5 |                    |                        |                      |        9916914233|
|181919894            |012984532A3                          |02/10/2012  |2101 |            9902511433|A199399Z  |I9902          |        |SOC12MCB101SOCJ31|22            |9951           |Operation 10                            |          1 |               0 |           68.080 |                    |                        |                      |        9916914224|
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
shawno
  • 343
  • 1
  • 3
  • 17
  • 1
    A search for "Get-Content large files" was very helpful. See http://rkeithhill.wordpress.com/2007/06/17/optimizing-performance-of-get-content-for-large-files/. – Jay Bazuzi Feb 25 '12 at 00:06

3 Answers3

16

Your script reads one line at a time (slow!) and stores almost the entire file in memory (big!).

Try this (not tested extensively):

$path = "E:\Documents\Projects\ESPS\Dev\DataFiles\DimProductionOrderOperation"
$infile = "14SEP11_ProdOrderOperations.txt"
$outfile = "PROCESSED_14SEP11_ProdOrderOperations.txt"

$batch = 1000

[regex]$match_regex = '^\|.+\|.+\|.+'
[regex]$replace_regex = '^\|(.+)\|$'

$header_line = (Select-String -Path $path\$infile -Pattern $match_regex -list).line

[regex]$header_regex = [regex]::escape($header_line)

$header_line.trim('|') | Set-Content $path\$outfile

Get-Content $path\$infile -ReadCount $batch |
    ForEach {
             $_ -match $match_regex -NotMatch $header_regex -Replace $replace_regex ,'$1' | Out-File $path\$outfile -Append
    }

That's a compromise between memory usage and speed. The -match and -replace operators will work on an array, so you can filter and replace an entire array at once without having to foreach through every record. The -readcount will cause the file to be read in chunks of $batch records, so you're basically reading in 1000 records at a time, doing the match and replace on that batch then appending the result to your output file. Then it goes back for the next 1000 records. Increasing the size of $batch should speed it up, but it will make it use more memory. Adjust that to suit your resources.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
mjolinor
  • 66,130
  • 7
  • 114
  • 135
  • Thanks. I looked at it again, and did a little tuning. Not sure how much that will help the performance, but if I'm reading the docs right, compiled regexes are faster, and casting the variable as [regex] causes them to be compiled. – mjolinor Feb 25 '12 at 03:25
  • This is good, but I also need to compare every element of the array with the first element (header) and remove it if there is a match - because it is repeated throughout the file. This seems to be where things really bog down in the script. With your approach how do I do this without assigning gC to a variable? – shawno Feb 25 '12 at 14:30
  • Do you need to include one copy of the header at the beginning of the output file? – mjolinor Feb 25 '12 at 14:35
  • I updated the script. It should grab the header line from the file, trim off the leading and trailing |, and start a new $outfile using that for the header before the loop starts. Using [regex]::escape, it creates a new regex that does a literal match to the header line, and then a -notmatch on that added to the -match -replace chain should drop those out inside the loop. – mjolinor Feb 25 '12 at 15:06
  • @mjolinor, this works in pieces -- the header piece by itself and the body piece by itself (without the -Append), but when you run it all together then nul characters or spaces are inserted between every character in the file (within the body piece). Something to do with how the "-append" works? – shawno Feb 25 '12 at 19:39
  • @jmolinor -- your version of the query made a HUGE difference. From 10 mins to 18 seconds. Amazing. The solution to the nul character problem can be found here: http://stackoverflow.com/questions/3806305/powershell-2-0-generates-nulls-between-characters, but basically the issue is with the default encoding for powershell. The solution was to also include -encoding ASCII next to the -Append. – shawno Feb 25 '12 at 20:02
  • @mjolinor: casting to [regex] doesn't compile the regex, but it does construct the Regex object once. It's difficult to speculate about performance, but in this scenario the disk IO is likely to dominate. Therefore, I suspect any casting vs. not casting vs. compiling the regex to have no noticeable effect. – Jay Bazuzi Feb 26 '12 at 01:23
  • Would you have to explicitly set the regex option to make it a compiled regex? – mjolinor Feb 26 '12 at 06:16
9

The Get-Content cmdlet does not perform as well as a StreamReader when dealing with very large files. You can read a file line by line using a StreamReader like this:

$path = 'C:\A-Very-Large-File.txt'
$r = [IO.File]::OpenText($path)
while ($r.Peek() -ge 0) {
    $line = $r.ReadLine()
    # Process $line here...
}
$r.Dispose()

Some performance comparisons:

Measure-Command {Get-Content .\512MB.txt > $null}

Total Seconds: 49.4742533

Measure-Command {
    $r = [IO.File]::OpenText('512MB.txt')
    while ($r.Peek() -ge 0) {
        $r.ReadLine() > $null
    }
    $r.Dispose()
}

Total Seconds: 27.666803

Andy Arismendi
  • 50,577
  • 16
  • 107
  • 124
  • That is a great suggestion. I should also point out that the get-content part of the script works pretty quickly; its that part where I loop through everything in the array comparing every element to the first element in the array that seems to make it run extremeley slow. – shawno Feb 24 '12 at 23:45
  • 4
    `Get-Content` does **not** load the whole file in to memory, but assigning the result to a variable does. This is an important design principle in PowerShell - streaming through a pipeline can be memory efficient even for very large data sets. – Jay Bazuzi Feb 25 '12 at 00:07
  • 1
    If the comparison is choking, I wonder if the PC is having to page in/out data as it checks the element x against element 0 because it can't fit the whole array of string in memory -- if so, $content[0] may not be a constant-time operation even though on a small dataset it could be assumed to be. Maybe try copying the contents of $content[0] into a new variable and compare against that instead of the array element. – Daniel Richnak Feb 25 '12 at 00:09
  • @JayBazuzi Fair point. Typically a variable is assigned to the output of Get-Content. I updated my answer based on your input. Thanks. – Andy Arismendi Feb 25 '12 at 00:59
  • @daniel richnak, thanks for that comment - I hadn't considered that regarding the element 0. – shawno Feb 25 '12 at 14:35
5

This is almost a non-answer...I love PowerShell...but I will not use it to parse log files, especially large log files. Use Microsoft's Log Parser.

C:\>type input.txt | logparser "select substr(field1,1) from STDIN" -i:TSV -nskiplines:14 -headerrow:off -iseparator:spaces -o:tsv -headers:off -stats:off
Dexter Legaspi
  • 3,192
  • 1
  • 35
  • 26
  • 4
    This is a good non-answer. Powershell can be painfully slow with huge files. Coming from a linux background you expect to pipe stuff through grep fairly efficiently, but Get-Content and Select-String just don't compete. I love PowerShell too, but not when parsing and filtering massive log files. – ben Jan 31 '14 at 11:54
  • As of 2023, this is not true anymore. you can use something like Get-Content input.txt -Buffer 1000 | ConvertFrom-CSV . The real difference is buffering the pipeline, remember powershell works on objects (not bytes), imagine calling a function for every line of a text file, now compare it with passing an array with 1000 lines. – Luiz Felipe May 05 '23 at 16:16