2

I have a large csv file approximately 9 GB in size. In Powershell, I need to split it into 10 MB chunks. The catch is that I need to keep the lines intact, so each split file ends at the end of a line, and starts at the start of the next line.

Since the file is so large, I need a means of splitting it which doesn't force the code to upload the entire document all at once (or it will crash).

I've been at this for two weeks, and haven't found a workable solution. The worst part is it takes over a day for the code to reach it's memory breaking point, so I can only test, debug and restart the code a few times per week. Very slow going.

Without fail, each code I've tried has triggered a memory-related error, such as "Get-Content : Insufficient memory to continue the execution of the program."

I've tried the following methods.

If you have a Powershell-related link to a working version of these approaches, or you know a better approach yourself, please weigh in.

I'm going to refrain from giving you complete copies of all three codes, 'cause that would get really long, but at least this way you'll know what methods I've already tried and the insufficient memory errors that keep popping up.

Any help would be appreciated. If I keep pulling out my hair on this issue, I'll be bald before my time!

Previous Attempts (Each one ran for over a day before failing due to a memory error)

  1. Get-Content (I was under the impression that Get-Content didn't try to load all the data at once, and was therefore the best choice for this.)
Get-Content $sourceAddress |
    Foreach-Object{

       $var1, var2, var3 ... = $_ -Split "," 
       
       # Push these variables into a new line in a csv file
       # If you've reached 100,000 lines, export a split file

}
  1. Import-Csv (Here, I tried extracting data between a range of lines)
for ($i; $i -le $MaxSplits; $i++)
{
   
(Import-Csv $InCsv)[$Min..$Max] | 
Export-Csv -Path $outfile -NoTypeInformation

}
  1. Stream Reader (ChatGPT says this is the best way. Yes, I'm desperate enough to ask an AI for coding advice)
$reader = New-Object System.IO.StreamReader($filePath)

while (!$reader.EndOfStream) {

    # Loop to retrieve the specified number of lines
    for ($i = 0; $i -lt $linesPerFile; $i++) {

        # Read the next line from the file
        # Check if the end of the file has been reached
        # If not, Add the line to the lines array
    }

    # Write the current batch of lines to the output file
    # Increment the file counter
}

Again, any help would be appreciated.

Santiago Squarzon
  • 41,465
  • 5
  • 14
  • 37
  • This may help https://github.com/santisq/PSCsvUtil you have split and merge there. The function from this answer can work if you dont want to install a module https://stackoverflow.com/a/73547526/15339544 (uses the same logic) – Santiago Squarzon Feb 07 '23 at 22:48
  • having a huge csv file is already wrong from the design. With that much data you need to use a binary format or a proper database instead. Anyway the 3rd method looks ok although `If not, Add the line to the lines array` will definitely trigger out-of-memory errors, instead you just print the line directly to the old file or open a new file – phuclv Feb 08 '23 at 00:18
  • What if you pipe it and do something like `get-content file.txt -ReadCount 2 | % { $i = 1 } { $_ | set-content "$i.txt"; $i++ }`, but instead of 2 it's 10mb. – js2010 Feb 08 '23 at 01:01
  • 1
    Does any of the fields in your csv file contain newlines? Does the large CSV file have headers? Are the field values all quoted? – Theo Feb 08 '23 at 13:36
  • I'll check with my boss and see if installing a new module can be approved. – S. B. Bennett Feb 08 '23 at 16:33
  • 1
    None of the fields contain new lines. They are all small strings. Id numbers, names, etc. The large CSV does have headers, but the client doesn't care if each split has those headers, because later it will all be chronologically uploaded into a processing program. I'm not sure what you mean by quoting the field values. Do you mean each individual cell's content in the large CSV? – S. B. Bennett Feb 08 '23 at 16:35

4 Answers4

1

You might use the steppable pipeline for this as that will prevent that you (re)open and close the destination file (using Export-Csv) with each iteration which is obviously very expensive:

$FileSize = 10Mb
Import-Csv .\Input.csv |
    ForEach-Object -Begin {
        $Index = 0
        $Path = $Null
    } -Process {
        if (!$Path) {
            $Index++
            $Path = ".\Batch$Index.csv"
            $Pipeline = { Export-Csv -notype -Path $Path }.GetSteppablePipeline()
            $Pipeline.Begin($True)            
        }
        $Pipeline.Process($_)
        if ((Get-Item $Path).Length -gt $FileSize) {
            $Pipeline.End()
            $Path = $Null
        }
    } -End {
        $Pipeline.End()
    }

For a full explanation, see: Mastering the (steppable) pipeline

iRon
  • 20,463
  • 10
  • 53
  • 79
0

Try something like this

$reader = New-Object System.IO.StreamReader($filePath + ".csv")
$writer = $null
$lineCount = 0;
$fileCount = 1;
While(($line = $reader.ReadLine()) -ne $null)
{
   $line = $line.Trim()
   if($line.Length -gt 0)
   {
      if($lineCount % 10000000 -eq 0)
      {
         if($writer -ne $null)
         {
             $writer.Flush()
             $writer.Close()
         }      
         $writer = New-Object System.IO.StreamWriter($filePath + $fileCount + ".csv");
         $fileCount++;
      }+
   }
   $writer.WriteLine($line)
   $lineCount++;
}
$writer.Flush();
$writer.Close();
$reader.Close();
jdweng
  • 33,250
  • 2
  • 15
  • 20
  • I'm almost there. I used this code to successfully split a 56 GB text file. However, the initial csv file I set out to split (which is only 9 GB in size) is triggering the following error: "Exception calling 'ReadLine' with '0' argument(s): 'Insufficient memory to continue the execution of the program.' It then highlights the "$line = $reader.ReadLine()" section of the code. Seriously, how can I have enough memory to cut up a 56 GB file, but not enough memory to cut up a 9 GB file?! – S. B. Bennett Feb 09 '23 at 23:07
  • This is a PS issue not releasing memory. Try dispose or garbage collect : https://stackoverflow.com/questions/4136066/powershell-does-not-release-the-memory – jdweng Feb 10 '23 at 11:06
0

I wouldn't mind multiple sets of eyes on this code. I think I thought of most everything, but not sure. This was done in PS 5.1, not sure how/if it works in PS 7.x.

This function is used to save each chunk of data. The decision to pass $FileLines as a parameter instead of through the pipeline could be a mistake - I don't like needing [AllowEmptyString()]. I'm lacking experience passing string arrays to functions.

One issue/problem with this task is the determining the source file's encoding, I found this, but didn't use it in the code. I would suggest inspecting the source file to determine what encoding it uses and then set the encoding of either, or both, Get-Content and Out-File commands in the function to match it. Also, I used a hex file viewer called HxD to inspect the input and output files.

function SaveFile {
    param (
        [Parameter(Mandatory = $true, Position = 0)]
        [int]$FileIndex,
        [Parameter(Mandatory = $true, Position = 1)]
        [AllowEmptyString()][string[]]$FileLines
    )
    $FilePath = Join-Path -Path $PSScriptRoot -ChildPath ("Chunk{0:d4}.CSV" -f $FileIndex)
    # Encoding options: ASCII, BigEndianUnicode, Default, OEM, String, Unicode, Unknown, UTF7, UTF8, UTF32
    $FileLines | Out-File -FilePath $FilePath -Encoding ascii
}

The file I used is some random file on my computer, replace that with your file's path and name.

Get-Content pipes the file's lines to Select-Object, which then creates Size and Text properties that are members of the $_ in the following Foreach-Object. The idea of Calculating string size came from this answer - since ASCII is one byte per char, should work faithfully. Each line's size is increased by 2, assuming each line has CR+LF at end.

The Foreach-Object then splits into a begin code block, process code block and an end code block.

In the Begin block, commenting out the currently active $OutFileSize assignment, and uncommenting the line just above should set the output file size to 10MB.

The process block calculates current total size, and if above $OutFileSize, then dumps the currently stored lines into the next file and sets current total size equal to the current lines's size.

The end block saves the last file chunk still in $LineList buffer before quitting.

Get-Content 'C:\WINDOWS\system32\VmChipset Third-Party Notices.txt' |
    Select-Object -Property @{Name = 'Size'; Expression = {[System.Text.Encoding]::ASCII.GetByteCount($_)+2}},@{Name = 'Text'; Expression = {$_}} |
    Foreach-Object -Begin {
        #[int]$OutFileSize = 10*1024*1024       # 10 MB size
        [int]$OutFileSize = 1024
        [int]$Index = 0
        $CurrentTotal = 0
        $LineList = [System.Collections.ArrayList]@()
    } -Process {
        if(($CurrentTotal += $_.Size) -ge $OutFileSize) {
            if($LineList.Count -gt 0) {
                SaveFile $Index $LineList.ToArray()
                $LineList.Clear()
            }
            $CurrentTotal = $_.Size
            $Index += 1
        }
        $null = $LineList.Add($_.Text)
    } -End {
        SaveFile $Index $LineList.ToArray()
    }

The file I used in my experiment had 6,658 bytes:

6,658 VmChipset Third-Party Notices.txt

And the sum of the sizes of the resulting files also comes out to 6,658.

02/07/2023  11:16 PM               967 Chunk0000.CSV
02/07/2023  11:16 PM             1,020 Chunk0001.CSV
02/07/2023  11:16 PM               990 Chunk0002.CSV
02/07/2023  11:16 PM               974 Chunk0003.CSV
02/07/2023  11:16 PM               965 Chunk0004.CSV
02/07/2023  11:16 PM               980 Chunk0005.CSV
02/07/2023  11:16 PM               762 Chunk0006.CSV

NOTE: I do NOT expect this to be fast, but if you are only needing to run it once - it may work for you.

Darin
  • 1,423
  • 1
  • 10
  • 12
0

Trying splitting a 100mb file into 10 10mb files, using about half a gig a memory, over 9 seconds. But 9 gigs might take 14 minutes. (I did it for real in 12 minutes.)

$content = foreach ($i in 1..11mb) { 'abcdefg' }
set-content -path file.csv -value $content
get-content file.csv -ReadCount 1mb | 
  % { $i = 1 } { set-content -path file$i.csv -value $_; $i++ }


history | select -last 1 | fl

Id                 : 348
CommandLine        : get-content file.csv -ReadCount 1mb | 
  % { $i = 1 } { set-content -path file$i.csv -value $_; $i++ }
ExecutionStatus    : Completed
StartExecutionTime : 2/10/2023 11:11:38 AM
EndExecutionTime   : 2/10/2023 11:11:47 AM


dir

    Directory: C:\Users\admin\foo

Mode                 LastWriteTime         Length Name
----                 -------------         ------ ----
-a----         2/10/2023  10:56 AM      103809024 file.csv
-a----         2/10/2023  11:02 AM        9437184 file1.csv
-a----         2/10/2023  11:03 AM        9437184 file10.csv
-a----         2/10/2023  11:03 AM        9437184 file11.csv
-a----         2/10/2023  11:02 AM        9437184 file2.csv
-a----         2/10/2023  11:02 AM        9437184 file3.csv
-a----         2/10/2023  11:02 AM        9437184 file4.csv
-a----         2/10/2023  11:02 AM        9437184 file5.csv
-a----         2/10/2023  11:03 AM        9437184 file6.csv
-a----         2/10/2023  11:03 AM        9437184 file7.csv
-a----         2/10/2023  11:03 AM        9437184 file8.csv
-a----         2/10/2023  11:03 AM        9437184 file9.csv

Make a real 9gb test file (7 min):

foreach ($i in 1..90) { $content = foreach ($i in 1..11mb) { 'abcdefg' }
  add-content -path file.csv -value $content }
js2010
  • 23,033
  • 6
  • 64
  • 66