2

I need to delete the first couple of lines of a .txt-file in powershell. There are plenty of questions and answers already on SA how to do it. Most of them copy the whole filecontent into memory, cut out the first x lines and then save the content into the textfile again. However, in my case the textfiles are huge (500MB+), so loading them completly into memory, just to delete the first couple of lines, takes very long and feels like a huge waste of resources.

Is there a more elegant approach? If you only want to read the first x lines, you can use

Get-Content in.csv -Head 10

, which only reads the first 10 lines. Is there something similar for deletion?

JoeDoe8877
  • 27
  • 1
  • 1
    Read the file in segments of, say, 10000 rows a time into a buffer and flush to disk. – vonPryz Apr 12 '22 at 11:44
  • Does `Get-Content in.csv |Select -Skip 10 |Set-Content out.csv -Encoding` work for you (might want to specify output `-Encoding`) – Mathias R. Jessen Apr 12 '22 at 11:45
  • @MathiasR.Jessen it is slow as well, because the Get-Content Operation is taking a lot of time (reads the whole file) – JoeDoe8877 Apr 12 '22 at 11:57
  • 2
    @JoeDoe8877 To my knowledge there's no way for NTFS to facilitate "truncating from the beginning", so your only bet is to copy the contents after the block you want to skip to a new file, and then delete the original file. Do you know the encoding of these files? If they're all ASCII/UTF7-encoded, then there's a fairly trivial way to facilitate the copy much faster than `Get-Content`... – Mathias R. Jessen Apr 12 '22 at 12:08
  • 2
    I think you can use `StreamReader` and `StreamWriter` for this though its important to know the encoding of your file – Santiago Squarzon Apr 12 '22 at 12:53

4 Answers4

4

Here is a another way to do it using StreamReader and StreamWriter, as noted in comments, it's important to know the encoding of your file for this use case.

See Remarks from the Official Documentation:

The StreamReader object attempts to detect the encoding by looking at the first four bytes of the stream. It automatically recognizes UTF-8, little-endian Unicode, big-endian Unicode, little-endian UTF-32, and big-endian UTF-32 text if the file starts with the appropriate byte order marks. Otherwise, the user-provided encoding is used. See the Encoding.GetPreamble method for more information.

If you need to specify an Encoding you can target the StreamReader(String, Encoding) Constructor. For example:

$reader = [System.IO.StreamReader]::new('path\to\input.csv', [System.Text.Encoding]::UTF8)

As noted previously in Remarks, this might not be needed for common encodings.

An alternative to below code, could be the use of $reader.ReadToEnd() as Brice points out in his comment, after skipping the first 10 lines, this would read the entire contents of the file in memory before writing to the new file. I haven't used this method for this answer since, mklement0's helpful answer provides a very similar solution to the problem and this answer was intended to be a memory friendly solution.

try {
    $reader = [System.IO.StreamReader]::new('absolute\path\to\input.csv')
    $writer = [System.IO.StreamWriter]::new('absolute\path\to\output.csv')

    # skip 10 lines
    foreach($i in 1..10) {
        $null = $reader.ReadLine()
    }

    while(-not $reader.EndOfStream) {
        $writer.WriteLine($reader.ReadLine())
    }
}
finally {
    ($reader, $writer).foreach('Dispose')
}

It's very also worth noting zett42's helpful comment using $reader.ReadBlock(Char[], Int32, Int32) method and $writer.Write(..) instead of $write.WriteLine(..) could be an even faster and still memory friendly alternative to read and write in chunks.

Santiago Squarzon
  • 41,465
  • 5
  • 14
  • 37
  • 1
    Why not using $writer.Write($reader.ReadToEnd()) instead of your 2nd loop ? Looks faster for me – Brice Apr 12 '22 at 14:09
  • 1
    @Brice because that would be almost the same as mklement0's answer. This answer attempts to be memory friendly solution rather than reading the file as whole in memory – Santiago Squarzon Apr 12 '22 at 14:12
  • 2
    For the 2nd loop I believe `StreamReader.ReadBlock()` / `StreamWriter.Write()` with a sufficiently large block size (say 64k) could be faster and still be memory friendly. – zett42 Apr 12 '22 at 14:24
  • 1
    That makes sense. Please correct me if i'm wrong, your solution automatically takes care of common encoding type and with the readToEnd method, it looks like I got better results than mklement0's answer, both in memory usage and speed. – Brice Apr 12 '22 at 14:26
  • @Brice, this solution is the fastest (line-by-line) _memory-friendly_ solution in PowerShell. The `Get-Content -Raw` solution in my answer is much faster, but requires reading the whole file into memory at once (it also avoids the newline-format problem). – mklement0 Apr 12 '22 at 14:39
  • 1
    @mklement0, as stated in my 2nd comment, I got better speed results (around 2 times) with combination of Santiago's answer + readtoend vs Get-Content -Raw on my laptop. PS 5.1, W10, text file 1GB. In any case, thanks to both of you for such detailled answers. – Brice Apr 12 '22 at 14:44
  • Wow lots of comments lol. @zett42 thanks that's a great alternative (I have added the method to the answer for OP to explore this as a homework) – Santiago Squarzon Apr 12 '22 at 15:02
  • @Brice I haven't personally tested the efficiency vs `Get-Content` tho, I believe if using `-Raw` vs `.ReadToEnd()` the speed should be almost the same. I have added your comment to the answer and explained why I haven't gone much into detail with it. Thanks for the comments – Santiago Squarzon Apr 12 '22 at 15:06
  • 1
    @Brice, good point: I missed the hybrid approach with the `.ReadToEnd()` angle. Like Santiago, however, I expected that to perform about the same as the `Get-Content -Raw` solution, and in PowerShell _Core_ (tested on 7.2.2) `Get-Content -Raw` is even _faster_, but , curiously, in _Windows PowerShell_, as you've observed, not only is `Get-Content -Raw` _slower_, but significantly so - I don't know why. – mklement0 Apr 12 '22 at 15:55
  • 1
    @mklement0 my guess is my beloved .NET Core vs awful .NET Framework :) – Santiago Squarzon Apr 12 '22 at 15:56
  • 1
    @Santiago :) You can't fully dismiss the old-timer, however: in terms of absolute performance, the faster WinPS solution beats the faster PS Core solution. Also, I suspect that (Windows) PowerShell is to blame in this case, though I'm unclear on what, specifically, causes the slow-down (it isn't the regex operation). – mklement0 Apr 12 '22 at 16:06
  • 1
    @mklement0 I was joking hehe have a bias for .NET Core over Framework (same applies for PS Core over WinPS) :) – Santiago Squarzon Apr 12 '22 at 16:08
  • 2
    @Brice The bottom line with respect to choosing between the two read-file-in-full solutions: When performance is paramount, use the `Get-Content -Raw` solution In PowerShell _Core_, and your .NET API `.ReadToEnd()` solution in _Windows PowerShell_. If you need to ensure that the newline format is preserved, you must use the `Get-Content -Raw` solution. Conversely, in Windows PowerShell you must use the .NET API solution if you're dealing with character encodings not supported by the `-Encoding` parameter (in PowerShell _Core_, `-Encoding` supports _all_ available .NET encodings). – mklement0 Apr 12 '22 at 16:10
  • @mklement0 did you test performance with `[System.IO.File]::ReadAllText(..)` vs `-Raw`? Tho I believe it's a wrapper of `StreamReader` isn't it? – Santiago Squarzon Apr 12 '22 at 16:14
  • 1
    Good question, @Santiago: I've only tested on macOS, but to my surprise `ReadAllText()` is not only slower than `Get-Content -Raw`, but also slower than the `.ReadLine()` 10 x + `.ReadToEnd()` solution. Here's the code: `try { $writer = [System.IO.StreamWriter]::new("$pwd/out.csv"); $writer.Write( [System.IO.File]::ReadAllText("$pwd/in.csv") -replace '^(?:.*\r?\n){10}' ) } finally { $writer.Dispose() } ` – mklement0 Apr 12 '22 at 16:34
  • 3
    @mklement0, what do you think of for a good ratio perf/memory ? [System.IO.File]::AppendAllLines("$pwd/out.csv", [Linq.Enumerable]::Skip([System.IO.File]::ReadLines("$pwd/in.csv"), 10)) – Brice Apr 12 '22 at 18:29
  • @Brice, nicely done: this is indeed an elegant and well-performing memory-friendly line-by-line solution that isn't much slower than the overall fastest read-file-in-full approaches. The speed gain comes from deferring the (lazy) iteration to a .NET method instead of looping (much more slowly) in PowerShell code. Note that you should use `[System.IO.File]::WriteAllLines()`, however. – mklement0 Apr 12 '22 at 18:57
3

You're essentially attempting to remove the starting bytes of the file without modifying the remaining bytes, Raymond C has a good read posted here about why that can't be done.

The underlying abstract model for storage of file contents is in the form of a chunk of bytes, each indexed by the file offset. The reason appending bytes and truncating bytes is so easy is that doing so doesn’t alter the file offsets of any other bytes in the file. If a file has ten bytes and you append one more, the offsets of the first ten bytes stay the same. On the other hand, deleting bytes from the front or middle of a file means that all the bytes that came after the deleted bytes need to “slide down” to close up the space. And there is no “slide down” file system function.

Mike Anthony
  • 429
  • 3
  • 9
3

As Mike Anthony's helpful answer explains, there is no system-level function that efficiently implements what you're trying to do, so you have no choice but to rewrite your file.

While memory-intensive, the following solution is reasonably fast:

  • Read the file as a whole into memory, as a single string, using Get-Content's -Raw switch...

    • This is orders of magnitude faster than the line-by-line streaming that Get-Content performs by default.
  • ... then use regex processing to strip the first 10 lines ...

  • ... and save the trimmed content back to disk.

Important:

  • Since this rewrites the file in place, be sure to have a backup copy of your file.

  • Use -Encoding with Get-Content / Set-Content to correctly interpret the input / control the output character encoding (PowerShell fundamentally doesn't preserve the information about the character encoding of a file that was read with Get-Content). Without -Encoding, the default encoding is the system's active ANSI code page in Windows PowerShell, and, more sensibly, BOM-less UTF-8 in PowerShell (Core) 7+.

# Use -Encoding as needed.
(Get-Content -Raw in.csv) -replace '^(?:.*\r?\n){10}' | 
  Set-Content -NoNewLine in.csv

If the file is too large to fit into memory:

If you happen to have WSL installed, an efficient, streaming tail solution is possible:

Note:

  • Your input file must use a character encoding in which a LF character is represented as a single 0xA byte - which is true of most single-byte encodings and also of the variable-width UTF-8 encoding, but not of, say, UTF-16.

  • You must output to a different file (which you can later replace the input file with).

bash.exe -c 'tail +11 in.csv > out.csv' 

Otherwise, line-by-line processing is required.

Note: I'm leaving aside other viable approaches, namely those that either read and write the file in large blocks, as zett42 recommends, or an approach that collects (large) groups of output lines before writing them to the output file in a single operation, as shown in Theo's helpful answer.

Caveat:

  • All line-by-line processing approaches risk inadvertently changing the newline format of the original file: on writing the lines back to a file, it is invariably the platform-native newline format that is used (CLRF on Windows, LF on Unix-like platforms).

  • Also, the information as to whether the input file had a trailing newline or not is lost.

Santiago's helpful answer shows a solution based on .NET APIs, which performs well by PowerShell standards.

  • Brice came up with an elegant and significant optimization that lets a .NET method perform the (lazy) iteration over the file's lines, which is much faster than looping in PowerShell code:

    [System.IO.File]::WriteAllLines(
      "$pwd/out.csv",         
      [Linq.Enumerable]::Skip(
         [System.IO.File]::ReadLines("$pwd/in.csv"),
         10
      )
    )
    

For the sake of completeness, here's a comparatively slower, PowerShell-native solution using a switch statement with the -File parameter for fast line-by-line reading (much faster than Get-Content):

  & {
    $i = 0
    switch -File in.csv {
      default { if (++$i -ge 11) { $_ } }
    }
  } | Set-Content out.csv  # use -Encoding as needed

Note:

  • Since switch doesn't allow specifying a character encoding for the input file, this approach only works if the character encoding is correctly detected / assumed by default. While BOM-based files will be read correctly, note that switch makes different assumptions about BOM-less files based on the PowerShell edition: in Windows PowerShell, the system's active ANSI code page is assumed; in PowerShell (Core) 7+, it is UTF-8.

  • Because language statements cannot directly serve as pipeline input, the switch statement must be called via a script block (& { ... })

  • Streaming the resulting lines to Set-Content via the pipeline is what slows the solution down. Passing the new file content as an argument, to Set-Content's -Value parameter would drastically speed up the operation - but that would again require that the file fit into memory as a whole:

    # Faster reformulation, but *input file must fit into memory as  whole*.
    # `switch` offers a lot of flexibility. If that isn't needed
    # and reading the file in full is acceptable, the
    # the Get-Content -Raw solution at the top is the fastest Powershell solution.
    Set-Content out.csv $(
      $i = 0
      switch -File in.csv {
        default { if (++$i -ge 11) { $_ } }
      }
    )
    
mklement0
  • 382,024
  • 64
  • 607
  • 775
1

There may be another alternative by using switch to read the files line-by line and buffering a certain maximum amount of lines in a List. This would be lean on memory consumtion and at the same time limit the number of disk writes to speed up the process.

Something like this perhaps

$maxBuffer   = 10000  # the maximum number of lines to buffer
$linesBuffer = [System.Collections.Generic.List[string]]::new()

# get an array of the files you need to process
$files = Get-ChildItem -Path 'X:\path\to\the\input\files' -Filter '*.txt' -File
foreach ($file in $files) {
    # initialize a counter for omitting the first 10 lines lines and clear the buffer
    $omitCounter = 0 
    $linesBuffer.Clear()
    # create a new file path by appending '_New' to the input file's basename
    $outFile = '{0}\{1}_New{2}' -f $file.DirectoryName, $file.BaseName, $file.Extension

    switch -File $file.FullName {
        default {
            if ($omitCounter -ge 10) {
                if ($linesBuffer.Count -eq $maxBuffer) {
                    # write out the buffer to the new file and clear it for the next batch
                    Add-Content -Path $outFile -Value $linesBuffer
                    $linesBuffer.Clear()
                }
                $linesBuffer.Add($_)
            }
            else { $omitCounter++ }  # no output, just increment the counter
        }
    }
    # here, check if there is still some data left in the buffer
    if ($linesBuffer.Count) { Add-Content -Path $outFile -Value $linesBuffer }
}
Theo
  • 57,719
  • 8
  • 24
  • 41