I am intrigued by this question How to sort 30Million csv records in Powershell and came up with a solution which builds temporary files.
Now I am trying to come up with another approach which comes down to first building an sorted index list ([int[]]
) and than pick a bulk of those indices (e.g. 1e6
) from the source file and drop them onto the pipeline:
Function Sort-BigCsv {
[CmdletBinding()] param(
[string]$FilePath,
[String]$Property,
[Int]$BulkSize = 1e6,
[System.Text.Encoding]$Encoding = [System.Text.Encoding]::Default
)
Begin {
if ($FilePath.StartsWith('.\')) { $FilePath = Join-Path (Get-Location) $FilePath }
$Index = 0
$Dictionary = [System.Collections.Generic.SortedDictionary[string, int]]::new()
Import-Csv $FilePath -Encoding $Encoding | Foreach-Object { $Dictionary[$_.$Property] = $Index++ }
$IndexList = [int[]]($Dictionary.Values)
$Dictionary = $Null # we only need the sorted index list
}
Process {
$Start = 0
While ($Start -lt $IndexList.Count) {
[System.GC]::Collect()
$End = $Start + $BulkSize - 1
if ($End -ge $IndexList.Count) { $End = $IndexList.Count - 1 }
Import-Csv $FilePath -Encoding $Encoding |
Select-Object -Index $IndexList[$Start..$End] | # Note that the -Index parameter reorders the list
Sort-Object $Property | # Consider smarter sort as this has already be done before
$Start = $End + 1
}
}
}
Example:
Sort-BigCsv .\Input.Csv Id -BulkSize 100 # | Export-Csv .\Output.Csv
I think that the general idea behind this should work, but I have second guesses what PowerShell is actually doing in terms of passing on the objects to the next cmdlet(/display), and questions arise like:
- Will every single item (including multiple items created within one
Process
block cycle) always immediately be picked up and processed by next cmdlet? - Will there be any difference for this function if I put everything in the
Process
block into theEnd
block? - What if the next process block is slower than the current one?
- Will it stall the current one?
- Or will the items be buffered?
- If they are buffered, can I force them to be taken by the next cmdlet, or wait till they are consumed?
Maybe it is just working as supposed (it is hard to tell from e.g. the memory size in the task manager), but I would like to confirm this... Is there any check and/or control whether an item is passed on (or is this just simply always the case after a Write-Output`? Meaning, if the last cmdlet stalls, the first cmdlet will also needs to stall...)