1

What is the most efficient way of replacing the first line of a HUMONGOUS CSV file? The files I am working with are nearly 40gb in size. I am currently reading in the header row, cleaning it up, and trying to write it back to the file. This is currently what I am doing in powershell when writing back to the file:

Import-CSV $file -Header $newHeader | Select -Skip 1 | Export-CSV -Path $newFile -NoTypeInformation

I am wondering if there is a way for me to replace that first line, without reading the entire file into memory, then writing it to a new file.

UPDATE

I have modified my script to read/write using the following:

$sr = [System.IO.StreamReader] $inputFile
$sw = [System.IO.StreamWriter] $outputFile

#Get rid of the first row
$null = $sr.ReadLine()

$Header = #my new header
$sw.WriteLine($Header)
        
#Write the remainder of the old file to the new report file
while($line = $sr.ReadLine()) {
    $sw.WriteLine($line)
}
Johnrad
  • 2,637
  • 18
  • 58
  • 98
  • 2
    Well, 40gb makes it pretty hard to read the whole file in memory :) so you're likely have to look into streamreader and streamwriter – Santiago Squarzon Apr 29 '22 at 19:18
  • 2
    @SantiagoSquarzon In my understanding, OP's code should be a streaming approach already (albeit not the fastest one). Have I overlooked something? – zett42 Apr 29 '22 at 19:37
  • @zett42 you should propose as answer what you proposed in [this comment](https://stackoverflow.com/questions/71841990/powershell-efficient-way-to-delete-first-10-rows-of-a-huge-textfile/71843670#comment126957719_71843670), first write the new headers to the file then ignore the first line then write in chunks. In my eyes, this is the best way to do this. – Santiago Squarzon Apr 29 '22 at 20:08
  • I have updated my question with with a new method for reading/writing the file. Using the streamreader/writer libraries. However, I am fairly unfamiliar with them. So any suggestions to be more time efficient with my code is greatly appreciated. – Johnrad Apr 29 '22 at 20:20
  • 1
    @SantiagoSquarzon I would, if I could actually get it working. Maybe with a fresh head tomorrow. Feel free to give it a try too. – zett42 Apr 29 '22 at 21:29
  • 1
    @zett42 I got you :) – Santiago Squarzon Apr 29 '22 at 21:39

2 Answers2

2

This was initially proposed by zett42 in his helpful comment some time ago and I think this would be the best balance between speed and memory efficiency in this case. The concept is to read and write in chunks.

It is important to understand that this may bring encoding issues as I explained in the same linked answer, you need to test for yourself if the code as-is works correctly or you need to change the encoding (how to read and write using a specific encoding is also explained in that answer).


Using 4kb as the recommended buffer size based on these excellent answers:


using namespace System.IO
using namespace System.Text

try {
    $sr = [File]::OpenRead('absolute\path\to\input.csv')
    $sw = [File]::OpenWrite('absolute\path\to\output.csv')
    $buffer = [char[]]::new(4kb)
    $reader = [StreamReader]::new($sr, [Encoding]::UTF8, $true, $buffer.Length)
    $writer = [StreamWriter]::new($sw, [Encoding]::UTF8, $true, $buffer.Length)

    # ignore the first line (headers)
    $null = $reader.ReadLine()
    # define the new headers
    $header = '"ColumnA","ColumnB","ColumnC"'
    # and write them to the first line of our output file
    $writer.WriteLine($Header)

    # read the file in chunks of `$buffer` size
    while($len = $reader.Read($buffer)) {
        # write the output in chunks of `$buffer` size
        $writer.Write($buffer, 0, $len)
    }
}
finally {
    # flush and dispose the streams
    $writer.ForEach('Flush')
    ($reader, $writer, $sr, $sw).ForEach('Dispose')
}
Santiago Squarzon
  • 41,465
  • 5
  • 14
  • 37
0

You could use a very simple script. Not sure about efficiency. But will function. This isn't proven to be the fastest (but I've had much better performance than other commands). But it reads the file one line at a time so extremely reduces memory usage allowing for completion of the task.

$count =1
For each ($line in [system.io.file]::readlines()) {
    If ($count -eq 1) {
        $count++
        Continue
    }
    $line | Out-File c:\out.csv -Append 
}

You could continually increment count and save every 1000 lines to prevent excessive Out-File calls. This might optimize it a bit.

If you're just trying to delete the first line, something like notepad++ might be able to partially open it so you can edit that line and save the file. (It doesn't open the entire file only where you're looking plus a few pages)

Another option is to use Linux. Wsl might work. Their file editing capabilities have blown PowerShell out of the water.

Using tools like awk or sed. Or maybe just get a line count and tail the last (length -1) of the file. Trust me. It's fast. Let's put it this way. PowerShell took 15 mins to deal with a 150meg file and Linux did it in 2 or 3 seconds.

Robert Cotterman
  • 2,213
  • 2
  • 10
  • 19