0

I am intrigued by this question How to sort 30Million csv records in Powershell and came up with a solution which builds temporary files. Now I am trying to come up with another approach which comes down to first building an sorted index list ([int[]]) and than pick a bulk of those indices (e.g. 1e6) from the source file and drop them onto the pipeline:

Function Sort-BigCsv {
    [CmdletBinding()] param(
        [string]$FilePath,
        [String]$Property,
        [Int]$BulkSize = 1e6,
        [System.Text.Encoding]$Encoding = [System.Text.Encoding]::Default
    )
    Begin {
        if ($FilePath.StartsWith('.\')) { $FilePath = Join-Path (Get-Location) $FilePath }
        $Index = 0
        $Dictionary = [System.Collections.Generic.SortedDictionary[string, int]]::new()
        Import-Csv $FilePath -Encoding $Encoding | Foreach-Object { $Dictionary[$_.$Property] = $Index++ }
        $IndexList = [int[]]($Dictionary.Values)
        $Dictionary = $Null                                     # we only need the sorted index list
    }
    Process {
        $Start = 0
        While ($Start -lt $IndexList.Count) {
            [System.GC]::Collect()
            $End = $Start + $BulkSize - 1
            if ($End -ge $IndexList.Count) { $End = $IndexList.Count - 1 }
            Import-Csv $FilePath -Encoding $Encoding |
                Select-Object -Index $IndexList[$Start..$End] | # Note that the -Index parameter reorders the list
                    Sort-Object $Property |                     # Consider smarter sort as this has already be done before
            $Start = $End + 1
        }
    }
}

Example:

Sort-BigCsv .\Input.Csv Id -BulkSize 100 # | Export-Csv .\Output.Csv

I think that the general idea behind this should work, but I have second guesses what PowerShell is actually doing in terms of passing on the objects to the next cmdlet(/display), and questions arise like:

  • Will every single item (including multiple items created within one Process block cycle) always immediately be picked up and processed by next cmdlet?
  • Will there be any difference for this function if I put everything in the Process block into the End block?
  • What if the next process block is slower than the current one?
    • Will it stall the current one?
    • Or will the items be buffered?
      • If they are buffered, can I force them to be taken by the next cmdlet, or wait till they are consumed?

Maybe it is just working as supposed (it is hard to tell from e.g. the memory size in the task manager), but I would like to confirm this... Is there any check and/or control whether an item is passed on (or is this just simply always the case after a Write-Output`? Meaning, if the last cmdlet stalls, the first cmdlet will also needs to stall...)

iRon
  • 20,463
  • 10
  • 53
  • 79
  • 1
    Common parameter [-OutBuffer](https://learn.microsoft.com/en-us/powershell/module/microsoft.powershell.core/about/about_commonparameters?view=powershell-7.1#outbuffer) might be useful for this. – zett42 Feb 11 '21 at 17:36
  • @zett42, thanks for reminding me on this parameter, I think that the answer is actually in the description: **If you omit this parameter, objects are sent as they're generated.** – iRon Feb 11 '21 at 17:44
  • Also interesting find: https://powershell.one/tricks/performance/pipeline I wasn't aware that you could pipe directly into a script block as an alternative to `ForEach-Object`. On my system the script block is ~4x as fast as `ForEach-Object`. – zett42 Feb 11 '21 at 17:56
  • It's even a greater difference when I simplify the output: `(Measure-Command { 1..1000000 | ForEach { $_ } } ).TotalSeconds` --> 2.35 s // `(Measure-Command { 1..1000000 | & { process { $_ } } } ).TotalSeconds` --> 0.34 s – zett42 Feb 11 '21 at 18:02
  • Good to know, but the issue is more preserving memory than performance. How can I sort an `csv` file with doesn't fit the available memory? I think you need sort the list in chunks. But it is not completely clear (to me, looking to the memory usage) whether a chunk has left the pipeline or not. – iRon Feb 11 '21 at 18:13

0 Answers0