2

The csv is about a gig and has over 1million rows. I'm attempting to split it into 5. The first 4 take about 18 hours - the 5th never happens ( or I havnt been patient enough) my pc will just sit in a sleep mode because it wont unlock.

function Split-Csv{

    [CmdletBinding()]

    Param(

        [Parameter(Mandatory)]

        [string]$FilePath,

        [Parameter()]

        [string]$Delimiter=',',

        [Parameter(Mandatory)]

        [string]$TargetFolder,

        [Parameter(Mandatory)]

        [string]$Name,

        [Parameter(Mandatory)]

        [int]$NumberOfFiles

    )

 

    try{

        if(-not (Test-Path $TargetFolder)){

            New-Item -Path $TargetFolder -ItemType Directory

        }

 

        $csvData=Import-Csv -Path $FilePath -Delimiter $Delimiter

 

        $startRow=0

 

        $numberOfRowsPerFile=[Math]::Ceiling($csvData.count/$NumberOfFiles)

 

        $counter=1

 

        while($startRow -lt $csvData.Count){

            $csvData | Select-Object -Skip $startRow -First $numberOfRowsPerFile | Export-Csv -Path "$TargetFolder\$Name-$counter.csv" -NoTypeInformation -NoClobber

            $startRow+=$numberOfRowsPerFile

            $counter++

        }

    }catch{

        Write-Error $_.Exception.Message

    }

}
Split-Csv -FilePath "C:\temp\2018.csv" -Delimiter ',' -TargetFolder "C:\temp\Split" -Name "Split" -NumberOfFiles 5
Abraham Zinala
  • 4,267
  • 3
  • 9
  • 24
MrMeester
  • 31
  • 2
  • 2
    This would be not so hard to do and very efficient using .NET APIs if you treat your CSV as plain text instead of objects, but chunking it into __X pieces__ is problematic, it would be easier if you chunk the file into pieces __of X Length__ – Santiago Squarzon Aug 30 '22 at 18:44
  • Problem with that is the file is too large to open. It crashes my pc – MrMeester Aug 30 '22 at 19:07
  • 1
    @SantiagoSquarzon - it relies on the data not containing line breaks in the literal data - e.g. ```"aaa, bbb`r`nxxx, `"yyy`r`nzzz`""``` so you don't split a record across files, but that's probably a reasonable assumption. in *most* cases.. – mclayton Aug 30 '22 at 19:08
  • See also: [Read CSV sheet in batches using Powershell](https://stackoverflow.com/a/73774733/1701026) – iRon Oct 12 '22 at 07:35

2 Answers2

2

You can give this function a try, as I stated in comments, this will be much faster and memory friendly if you use the .NET APIs instead of treating the CSV as objects.

This function uses the StreamReader and StreamWriter classes to read and write line by line.

Usage examples can as well as it's Merge-Csv counterpart are available through the PS Gallery as well as in the official GitHub Repo.

using namespace System.IO
using namespace System.Text
using namespace System.Management.Automation
using namespace System.Management.Automation.Language
using namespace System.Collections
using namespace System.Collections.Generic

# All Credits to jborean93 on the EncodingTransformation Class
# Source: https://gist.github.com/jborean93/50a517a8105338b28256ff0ea27ab2c8#file-get-extendedattribute-ps1

class EncodingTransformation : ArgumentTransformationAttribute {
    [object] Transform([EngineIntrinsics] $EngineIntrinsics, [object] $InputData) {
        $outputData = switch($InputData) {
            { $_ -is [Encoding] } { $_ }

            { $_ -is [string] } {
                switch ($_) {
                    ASCII { [ASCIIEncoding]::new() }
                    BigEndianUnicode { [UnicodeEncoding]::new($true, $true) }
                    BigEndianUTF32 { [UTF32Encoding]::new($true, $true) }
                    ANSI {
                        $raw = Add-Type -Namespace Encoding -Name Native -PassThru -MemberDefinition '
                            [DllImport("Kernel32.dll")]
                            public static extern Int32 GetACP();
                        '
                        [Encoding]::GetEncoding($raw::GetACP())
                    }
                    OEM { [Console]::OutputEncoding }
                    Unicode { [UnicodeEncoding]::new() }
                    UTF8 { [UTF8Encoding]::new($false) }
                    UTF8BOM { [UTF8Encoding]::new($true) }
                    UTF8NoBOM { [UTF8Encoding]::new($false) }
                    UTF32 { [UTF32Encoding]::new() }
                    default { [Encoding]::GetEncoding($_) }
                }
            }

            { $_ -is [int] } { [Encoding]::GetEncoding($_) }

            default {
                throw [ArgumentTransformationMetadataException]::new(
                    "Could not convert input '$_' to a valid Encoding object."
                )
            }
        }

        return $outputData
    }
}

class EncodingCompleter : IArgumentCompleter {
    [string[]] $EncodingSet = @(
        'ascii'
        'bigendianutf32'
        'unicode'
        'utf8'
        'utf8NoBOM'
        'bigendianunicode'
        'oem'
        'utf7'
        'utf8BOM'
        'utf32'
        'ansi'
    )

    [IEnumerable[CompletionResult]] CompleteArgument (
        [string] $commandName,
        [string] $parameterName,
        [string] $wordToComplete,
        [CommandAst] $commandAst,
        [IDictionary] $fakeBoundParameters
    ) {
        [CompletionResult[]] $arguments = foreach($enc in $this.EncodingSet) {
            if($enc.StartsWith($wordToComplete)) {
                [CompletionResult]::new($enc)
            }
        }
        return $arguments
    }
}

function Split-Csv {
    [CmdletBinding(DefaultParameterSetName = 'ByChunks')]
    param(
        [Parameter(Position = 0, Mandatory, ValueFromPipeline, ValueFromPipelineByPropertyName)]
        [alias('FullName')]
        [string] $Path,

        [Parameter(Mandatory)]
        [string] $DestinationFolder,

        [Parameter(ParameterSetName = 'BySize')]
        [int64] $Size = 1kb,

        [Parameter(ParameterSetName = 'ByChunks')]
        [int32] $Chunks = 3,

        [Parameter()]
        [EncodingTransformation()]
        [ArgumentCompleter([EncodingCompleter])]
        [Encoding] $Encoding = 'utf8',

        [Parameter()]
        [switch] $PassThru
    )

    begin {
        $Destination = $PSCmdlet.GetUnresolvedProviderPathFromPSPath($DestinationFolder)

        class ChunkWriter {
            [FileInfo] $Source
            [string] $Destination
            [string] $Headers
            [string] $Format
            [Encoding] $Encoding

            [StreamWriter] GetNewWriter([int32] $Index) {
                $name     = [string]::Format(
                    '{0} - Part {1}{2}',
                    $this.Source.BaseName,
                    $Index.ToString($this.Format),
                    $this.Source.Extension
                )
                $newChunk = Join-Path $this.Destination -ChildPath $name
                $writer   = [StreamWriter]::new($newChunk, $false, $this.Encoding)
                $writer.AutoFlush = $true
                $writer.WriteLine($this.Headers)
                return $writer
            }
        }
    }
    process {
        try {
            [FileInfo] $Path = $PSCmdlet.GetUnresolvedProviderPathFromPSPath($Path)
            $null    = [Directory]::CreateDirectory($Destination)
            $reader  = [StreamReader]::new($Path.FullName, $Encoding, $true)
            $headers = $reader.ReadLine()
            $Index   = 0

            if($PSCmdlet.ParameterSetName -eq 'ByChunks') {
                $chunkSize = ($Path.Length - $headers.Length) / $Chunks + ($headers.Length * $Chunks)
                $format    = 'D{0}' -f $Chunks.ToString().Length
            }
            else {
                $chunkSize = $Size - $headers.Length
                $format    = 'D{0}' -f [math]::Ceiling($Path.Length / $Size).ToString().Length
            }

            $chunkWriter = [ChunkWriter]@{
                Source      = $Path
                Destination = $Destination
                Headers     = $headers
                Format      = $format
                Encoding    = $Encoding
            }

            $writer = $chunkWriter.GetNewWriter($Index++)

            while(-not $reader.EndOfStream) {
                if($writer.BaseStream.Length -ge $chunkSize) {
                    $writer.Dispose()

                    if($PassThru.IsPresent) {
                        $writer.BaseStream.Name -as [FileInfo]
                    }

                    $writer = $chunkWriter.GetNewWriter($Index++)
                }
                $writer.WriteLine($reader.ReadLine())
            }
        }
        catch {
            $PSCmdlet.ThrowTerminatingError($_)
        }
        finally {
            $writer, $reader | ForEach-Object Dispose

            if($PassThru.IsPresent) {
                $writer.BaseStream.Name -as [FileInfo]
            }
        }
    }
}
Santiago Squarzon
  • 41,465
  • 5
  • 14
  • 37
1

I tested this code on various sizes of CSVs:

$incsv = Import-Csv .\BigCsv.csv

$incr = [int]($incsv.Count / 5)
$last = $incsv.Count - 1

$idx = @()

for ($i = 0; $i -le $last; $i += $incr)
{
    $end = $i + ($incr - 1)
    if ($end -gt $last) { $end = $last }
    $idx += @{ 'Start' = $i ; 'End' = $end }
}

for ($i = 0; $i -lt 5; $i++)
{
    $start = $idx[$i].Start
    $end = $idx[$i].End
    $incsv[$start..$end] | Export-Csv -NoType ".\Chunk$($i+1).csv"
}

I timed Import-Csv and each chunk export. Import-/Export-Csv don't scale very well, but should be OK for the size/rows you're working with. my final test CSV was 1.68 GB and ~1.1 million rows, 7200rpm drive on a 5-yo laptop - this should run much faster on better hardware:

Import-Csv : 00:05:51.2411580
Chunk 1    : 00:02:12.3754368
Chunk 2    : 00:01:16.5562393
Chunk 3    : 00:01:13.3434148
Chunk 4    : 00:01:14.5231756
Chunk 5    : 00:01:14.6382049

Total time : 00:13:02.6859795

the extra time from chunk1 might be a disk-related anomaly - in all the previous tests, all chunk times were near-identical, and the combined time for exports was roughly the same as the import time.

tkc
  • 11
  • 1
  • #Plan A: Using Pipes keep memory use low. get-content -path C:\temp\mybigfile.txt -raw | Select -first 200000 | Out-file -FilePath C:\temp\1file.txt get-content -path C:\temp\mybigfile.txt -raw | select -Skip 200000 -First 200000 | Out-file -FilePath C:\temp\2file.txt #Plan B: Have heard of people loading in a foreach. foreach ($data in get-content-path C:\temp\mybigfile.txt) { } – Aaron Aug 31 '22 at 02:11