17

I've seen the answer elsewhere for text files, but I need to do this for a compressed file.

I've got a 6G binary file which needs to be split into 100M chunks. Am I missing the analog for unix's "head" somewhere?

djsadinoff
  • 5,519
  • 6
  • 33
  • 40

3 Answers3

26

Never mind. Here you go:

function split($inFile,  $outPrefix, [Int32] $bufSize){

  $stream = [System.IO.File]::OpenRead($inFile)
  $chunkNum = 1
  $barr = New-Object byte[] $bufSize

  while( $bytesRead = $stream.Read($barr,0,$bufsize)){
    $outFile = "$outPrefix$chunkNum"
    $ostream = [System.IO.File]::OpenWrite($outFile)
    $ostream.Write($barr,0,$bytesRead);
    $ostream.close();
    echo "wrote $outFile"
    $chunkNum += 1
  }
}

Assumption: bufSize fits in memory.

djsadinoff
  • 5,519
  • 6
  • 33
  • 40
  • why do we need `$stream.seek`? The Read method automatically sets the current position, right? – Samik Apr 24 '14 at 12:14
  • You're probably right, @Samik. If you can test it to ensure that it works, I'll remove the line of code. – djsadinoff Apr 25 '14 at 07:00
  • Yes, I commented out the three lines involving $curOffset and it worked just as well. As I am using this script to split a text file, I had to add a few lines of code, so that it does not break in the middle of a line. Anyway, thanks for the code. – Samik Apr 26 '14 at 02:20
18

The answer to the corollary question: How do you put them back together?

function stitch($infilePrefix, $outFile) {

    $ostream = [System.Io.File]::OpenWrite($outFile)
    $chunkNum = 1
    $infileName = "$infilePrefix$chunkNum"

    $offset = 0

    while(Test-Path $infileName) {
        $bytes = [System.IO.File]::ReadAllBytes($infileName)
        $ostream.Write($bytes, 0, $bytes.Count)
        Write-Host "read $infileName"
        $chunkNum += 1
        $infileName = "$infilePrefix$chunkNum"
    }

    $ostream.close();
}
DrewDouglas
  • 493
  • 4
  • 9
1

I answered the question alluded to in this question's comments by bernd_k but I would use -ReadCount in this case instead of -TotalCount e.g.

Get-Content bigfile.bin -ReadCount 100MB -Encoding byte

This causes Get-Content to read a chunk of the file at a time where the chunk size is either a line for text encodings or a byte for byte encoding. Keep in mind that when it does this, you get an array passed down the pipeline and not individual bytes or lines of text.

Keith Hill
  • 194,368
  • 42
  • 353
  • 369
  • ...right, and then you need to figure out a way to get each chunk into a different file. The Jason Fossen link above recommends against manipulating large sets of data with get-content: "performance of get-content is horrible with large files. Unless you are reading less than 200KB, don’t use get-content..." Is that your experience? – djsadinoff Dec 28 '10 at 08:28
  • Also, can you express this as a complete solution akin to mine above? – djsadinoff Dec 28 '10 at 08:30
  • 1
    Got a chance to try this on a huge file and yeah, unless you've got a 64-bit PowerShell, forget about it. :-) I've had pretty good luck with read counts of 1KB but getting Get-Content to parcel it up into chunks of 100MB just doesn't scale. Too bad PowerShell can't handle this a bit more directly. – Keith Hill Dec 31 '10 at 01:51