100

I'm working with some multi-gigabyte text files and want to do some stream processing on them using PowerShell. It's simple stuff, just parsing each line and pulling out some data, then storing it in a database.

Unfortunately, get-content | %{ whatever($_) } appears to keep the entire set of lines at this stage of the pipe in memory. It's also surprisingly slow, taking a very long time to actually read it all in.

So my question is two parts:

  1. How can I make it process the stream line by line and not keep the entire thing buffered in memory? I would like to avoid using up several gigs of RAM for this purpose.
  2. How can I make it run faster? PowerShell iterating over a get-content appears to be 100x slower than a C# script.

I'm hoping there's something dumb I'm doing here, like missing a -LineBufferSize parameter or something...

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
scobi
  • 14,252
  • 13
  • 80
  • 114
  • 10
    To speed `get-content` up, set -ReadCount to 512. Note that at this point, $_ in the Foreach will be an array of strings. – Keith Hill Nov 16 '10 at 14:42
  • 1
    Still, I'd go with Roman's suggestion of using the .NET reader - much faster. – Keith Hill Nov 16 '10 at 16:53
  • Out of curiosity, what happens if I don't care about speed, but just memory? Most likely I will go with the .NET reader suggestion, but I'm also interested to know how to keep it from buffering the entire pipe in memory. – scobi Nov 16 '10 at 21:29
  • 9
    To minimize buffering avoid assigning the result of `Get-Content` to a variable as that will load the entire file into memory. By default, in a pipleline, `Get-Content` processes the file one line at a time. As long as you aren't accumulating the results or using a cmdlet which internally accumulates (like Sort-Object and Group-Object) then the memory hit shouldn't be too bad. Foreach-Object (%) is a safe way to process each line, one at a time. – Keith Hill Nov 16 '10 at 23:52
  • 1
    Forget the buffering, it's more to do with the Foreach-Object/% block defaulting to using -End if no property is given. Try `get-content | % -Process { whatever($_) }` if you want it to execute on each line as they come in. – dwarfsoft Mar 12 '15 at 23:06
  • 3
    @dwarfsoft that doesn't make any sense. The -End block only runs once after all the processing is done. You can see that if you try to use `get-content | % -End { }` then it complains because you haven't provided a process block. So it can't be using -End by default, it must be using -Process by default. And try `1..5 | % -process { } -end { 'q' }` and see that the end block only happens once, the usual `gc | % { $_ }` wouldn't work if the scriptblock defaulted to being -End... – TessellatingHeckler Apr 21 '17 at 17:22

4 Answers4

100

If you are really about to work on multi-gigabyte text files then do not use PowerShell. Even if you find a way to read it faster processing of huge amount of lines will be slow in PowerShell anyway and you cannot avoid this. Even simple loops are expensive, say for 10 million iterations (quite real in your case) we have:

# "empty" loop: takes 10 seconds
measure-command { for($i=0; $i -lt 10000000; ++$i) {} }

# "simple" job, just output: takes 20 seconds
measure-command { for($i=0; $i -lt 10000000; ++$i) { $i } }

# "more real job": 107 seconds
measure-command { for($i=0; $i -lt 10000000; ++$i) { $i.ToString() -match '1' } }

UPDATE: If you are still not scared then try to use the .NET reader:

$reader = [System.IO.File]::OpenText("my.log")
try {
    for() {
        $line = $reader.ReadLine()
        if ($line -eq $null) { break }
        # process the line
        $line
    }
}
finally {
    $reader.Close()
}

UPDATE 2

There are comments about possibly better / shorter code. There is nothing wrong with the original code with for and it is not pseudo-code. But the shorter (shortest?) variant of the reading loop is

$reader = [System.IO.File]::OpenText("my.log")
while($null -ne ($line = $reader.ReadLine())) {
    $line
}
Eduardo Pelais
  • 1,627
  • 15
  • 21
Roman Kuzmin
  • 40,627
  • 11
  • 95
  • 117
  • 4
    FYI, script compilation in PowerShell V3 improves the situation a bit. The "real job" loop went from 117 seconds on V2 to 62 seconds on V3 typed at the console. When I put the loop into a script and measured script execution on V3, it drops to 34 seconds. – Keith Hill May 28 '12 at 16:05
  • I put all three tests in a script and got these results: V3 Beta: 20/27/83 seconds; V2: 14/21/101. It looks like in my experiment V3 is faster in the test 3 but it is quite slower in the first two. Well, it’s Beta, hopefully performance will be improved in RTM. – Roman Kuzmin May 28 '12 at 16:34
  • why do people insist on using a break in a loop like that. Why not use a loop that does not require it, and reads better such as replacing the for loop with `do { $line = $reader.ReadLine(); $line } while ($line -neq $null)` – BeowulfNode42 Apr 28 '14 at 02:34
  • 2
    oops that's supposed to be -ne for not equal. That particular do..while loop has the problem that the null at the end of the file will be processed (in this case output). To work around that too you could have `for ( $line = $reader.ReadLine(); $line -ne $null; $line = $reader.ReadLine() ) { $line }` – BeowulfNode42 Apr 28 '14 at 02:51
  • 5
    @BeowulfNode42, we can do this even shorter: `while($null -ne ($line = $read.ReadLine())) {$line}`. But the topic is not really about such things. – Roman Kuzmin Mar 20 '15 at 13:11
  • 1
    @RomanKuzmin +1 that while-loop snippet you commented, it's easy to understand and would make a nice answer. However your actual answer with the `for(;;)` leaves me puzzled, is it pseudo-code or actually legit powershell syntax? Thanks a bunch if you'd like to elaborate a bit. – T_D Apr 12 '16 at 12:07
  • @T_D, see UPDATE 2 – Roman Kuzmin Apr 12 '16 at 12:53
  • @RomanKuzmin ah now I see, the `for(;;)` or `for()` is just some infinite loop where you break out, just like with `while(1 -eq 1)` etc.. Yeah, I normally never use such unsemantic code but I don't hate on those that do ^^ – T_D Apr 12 '16 at 13:04
  • wtf does `for()` do? – Kellen Stuart Nov 06 '17 at 22:09
  • 1
    `for()` means an infinite loop – Roman Kuzmin Nov 07 '17 at 11:40
  • `while($null -ne ($line = $read.ReadLine())) {$line}` Doesn't this cause a premature exit from the while if it encounters an empty line in the file? – lightwing Mar 08 '18 at 19:26
  • I tested and it doesn't. I don't know enough about powershell (or .net rather, I guess) to understand why. – lightwing Mar 08 '18 at 19:29
  • An empty line is not equal to null. `$null -eq ''` gets false. – Roman Kuzmin Mar 09 '18 at 04:21
  • For being "Power"shell, sure it's hard to get something that in bash you can do with a single line... – Stefano Borini Oct 01 '18 at 11:28
53

System.IO.File.ReadLines() is perfect for this scenario. It returns all the lines of a file, but lets you begin iterating over the lines immediately which means it does not have to store the entire contents in memory.

Requires .NET 4.0 or higher.

foreach ($line in [System.IO.File]::ReadLines($filename)) {
    # do something with $line
}

http://msdn.microsoft.com/en-us/library/dd383503.aspx

tresf
  • 7,103
  • 6
  • 40
  • 101
Despertar
  • 21,627
  • 11
  • 81
  • 79
2

If you want to use straight PowerShell check out the below code.

$content = Get-Content C:\Users\You\Documents\test.txt
foreach ($line in $content)
{
    Write-Host $line
}
0

For those interested...

A bit of perspective on this, since I had to work with very large files.

Below are the results on a 39 GB xml file containing 56 million lines/records. The lookup text is a 10 digit number

1) GC -rc 1000 | % -match -> 183 seconds
2) GC -rc 100 | % -match  -> 182 seconds
3) GC -rc 1000 | % -like  -> 840 seconds
4) GC -rc 100 | % -like   -> 840 seconds
5) sls -simple            -> 730 seconds
6) sls                    -> 180 seconds (sls default uses regex, but pattern in my case is passed as literal text)
7) Switch -file -regex    -> 258 seconds
8) IO.File.Readline       -> 250 seconds

1 and 6 are clear winners but I have gone with 1

PS. The test was conducted on a Windows Server 2012 R2 server with PS 5.1. The server has 16 vCPUs and 64 GB memory but for this test only 1 CPU was utilised whereas the PS process memory footprint was bare minimum as the tests above make use of very little memory.

Steve
  • 337
  • 4
  • 11