2

I have a giant array of 1M+ names in it, some are alpha numeric some are just alpha.

CSV:
id,firstname,lastname,email,email2,profession
100,Andeee,Michella,Andeee.Michella@yopmail.com,Andeee.Michella@gmail.com,police officer
101,Tybie,1Grobe,Tybie.Grobe@yopmail.com,Tybie.Grobe@gmail.com,worker
102,Fernande,Azeria,Fernande.Azeria@yopmail.com,Fernande.Azeria@gmail.com,developer
103,Lenna,Schenck,Lenna.Schenck@yopmail.com,Lenna.Schenck@gmail.com,police officer
104,4Marti,Brittani,Marti.Brittani@yopmail.com,Marti.Brittani@gmail.com,worker
105,Riannon,Aldric,Riannon.Aldric@yopmail.com,Riannon.Aldric@gmail.com,doctor
106,Corry,Nikaniki,Corry.Nikaniki@yopmail.com,Corry.Nikaniki@gmail.com,worker
107,Correy,Shama,Correy.Shama@yopmail.com,Correy.Shama@gmail.com,police officer
108,Marcy,Drus,Marcy.Drus@yopmail.com,Marcy.Drus@gmail.com,worker
109,Bill,Valerio,Bill.Valerio@yopmail.com,Bill.Valerio@gmail.com,worker

I don't want to use Sort-Oject or Sort for the entire array for it's taking way too long. This needs to be done in Powershell because of restrictions in my environment.

The array comes from a csv that I exported from another powershell job. ( I don't have access to the jobs code, just the results).

Here is the example that I created from the Java method I found. It failed with the following error: The script failed due to call depth overflow.

$array = @("Ryan", "Kelly", "Alex", "Kyle", "Riley")
mergeSort $array

write-host $array

function mergeSort
{

   param([string[]] $arr)

      if($arr.length -ge 2){
         
         #find mid-point
         $left_len = [math]::floor([int32]$arr.length/2)                                              
         
         #declare array size of left of mid-point
         $left = [string[]]::new($left_len);                                                                 

         #find mid-point
         $right_len = [math]::ceiling([int32]($arr.Length - $arr.Length /2))                
     
         #declare array size right of mid-point
         $right = [string[]]::new($right_len);                                                               

         for ($i = 0; $i -lt $left_len.Length; $i++){
            $left= $arr[$i]
         }#for loop

         for ($i = 0; $i -lt $right_len; $i++){
            $right = $arr[$i +$arr.Length/2]
         }
     }#if($arr.length -ge 2)

   mergeSort $left

   mergeSort $right

   merge ($arr, $left, $right)
}

function merge
{
    param ([string[]] $result,[string[]] $left, [string[]] $right)

    $i1 = 0

    $12 = 0

    for ($i = 0; $i -le $result.Length; $i++) {
      if($i2 -gt $right.Length -or ($i1 -lt $left.Length -and $left[$i1].CompareTo($right[$i2]) -lt 0)) {
         $result[$i] = $left[$i1]
          $i1++
       }
       else {
          $result[$i] = $right[$i2]
          $i2++
       }

   }

   $result.legnth

 }

This is latest solution I came up with following everyone's suggestions: I want to make this work in paralles but it throws errors:

$array = @('Ryan', 'Kelly', 'Alex', 'Kyle', 'Riley', '4test', 'test4', 'why', 'you', 'me', 'where', 'hello', 'jose', 'test', 
'Jelly', 'Plex', 'Cyle', 'Miley', '5test', '3test4', 'who', 'Bou', 'We', 'There', 'Yellow', 'Pose', 'West')

$type = ("System.Collections.Generic.SortedSet"+'`'+"2") -as "Type"
$type = $type.MakeGenericType( @( ("System.string" -as "Type"), ("system.string" -as "Type") ) )
$sortedArray = [Activator]::CreateInstance($type, 10000)

$a, $b = ($array | Split-Collection -Count ([int]$array.length/2) | %{ $_ -join ',' })

$firstCollection = $a.Split(",")
$secondCollection = $b.Split(",")

$counter = 0
$counterHalf = $array.Length/2

1..$counterHalf| ForEach {
    try {   
        $col1 = $firstCollection[$counter]
        $sortedArray.Add($col1, $counter)
    }
    catch { "Out of bound col 1" }

    try {    
        $col2 = $secondCollection[$counter]
        $sortedArray.Add($col2, $counter)
    }
    catch { "Out of bound col 2" }
    
    $counter++
}

$sortedArray


function Split-Collection {
    [CmdletBinding()]
    param(
        [Parameter(ValueFromPipeline=$true)] $Collection,
        [Parameter(Mandatory=$true)][ValidateRange(1, 247483647)][int] $Count)
    begin {
        $Ctr = 0
        $Arrays = @()
        $TempArray = @()
    }
    process {
        if (++$Ctr -eq $Count) {
            $Ctr = 0
            $Arrays += , @($TempArray + $_)
            $TempArray = @()
            return
        }
        $TempArray += $_
    }
    end {
        if ($TempArray) { $Arrays += , $TempArray }
        $Arrays
    }
}
JEuvin
  • 866
  • 1
  • 12
  • 31
  • I also saw something like this in Java but I can't figure out how to convert it to powershell. https://stackoverflow.com/questions/22925361/sorting-names-alphabetically-with-merge-sort-recursive – JEuvin Apr 28 '23 at 22:35
  • 2
    This doesn't make sense, if you split your array into chunks and then sort each chunk, how can you then ensure that when merged into 1 array it will be sorted? Use a `SortedSet` if you want efficiency. – Santiago Squarzon Apr 28 '23 at 22:35
  • Where does this array come from? Is it from a file? What do you want to do with this array? Keep in memory or export it? Please add this information to your question. You also need to clarify whats the sorting condition, is it lexicographically? is it by length? – Santiago Squarzon Apr 28 '23 at 22:38
  • @SantiagoSquarzon I answered your question in the main post. I also added code that I was trying to put together. I tried searching for SortedSet and only found .net code for it. Not sure what you mean about this. The array comes from a CSV generated by another PS script. I can pass it directly, I just made up a sample array for the sake of this example. – JEuvin Apr 28 '23 at 23:12
  • 1
    Yes `SortedSet` is a .NET class that you can use in pwsh, [here](https://stackoverflow.com/a/74657999/15339544) you have an example of it. what im trying to say is that by splitting your array and then merging it youre creating more overhead making it be more inefficient than it should. If this data comes for a CSV could you please add to your question an example of your CSV and also, does the PS script that generates this CSV allows you to stream the information instead of saving it to a file? if so please add this to your question – Santiago Squarzon Apr 28 '23 at 23:30
  • I updated the post again. Trying to get this to work in native powershell. Not sure if I can use that .net library. I am trying it, but its not looking good. The environment this is running in is very restrictive. – JEuvin Apr 28 '23 at 23:43
  • 1
    Use a sorted list. [https://stackoverflow.com/questions/74656156/powershell-single-line-huge-memory-usage/74657999#74657999](https://stackoverflow.com/questions/74656156/powershell-single-line-huge-memory-usage/74657999#74657999) – Acrometis Apr 29 '23 at 00:42
  • 1
    Regardless of any other issues in your code, this… ```merge ($arr, $left, $right)``` should be ```merge -result $arr -left $left -right $right``` - you don’t put brackets around parameters for Powershell cmdlets – mclayton Apr 29 '23 at 10:38
  • Unix sort is good for large files if it can do what you want. – js2010 Apr 29 '23 at 18:43

3 Answers3

3

FWIW, this is an answer to the original question about getting your Merge Sort code working. Unfortunately it's not very performant, so I don't know if it will actually help you with the wider problem of sorting your 1M+ rows...

The Good News

I made some tweaks to your original mergeSort that appear to fix it, at least for the sample array at the top of your question.

The fixes were mostly typos - see a screenshot from BeyondCompare to see the changes I made:

enter image description here

The Bad News

It's sloooooow.

PS> $array = [string[]] (1..10000 | % { $_.ToString() }) 
PS> measure-command {
    mergeSort $array
}

...
TotalMilliseconds : 11511.74

compared to

PS> $array = [string[]] (1..10000 | % { $_.ToString() }) 
PS> measure-command {
    $array = $array | sort-object
}

...
TotalMilliseconds : 36.8607

Maybe it performs better with the scale of data you're talking about, but I didn't have the patience to test it :-)

The Ugly

I also tweaked the code slightly so it sorts the left side before doing anything with the right side, which means it doesn't need to use as much memory.

Here's the updated sample, for posterity.

$ErrorActionPreference = "Stop";
Set-StrictMode -Version "Latest";

function mergeSort
{

    param([string[]] $arr)

    if( $arr.length -gt 1 )
    {

        # sort the left side
        $left_len = [Math]::Floor([int32]$arr.length / 2)
        $left = [string[]]::new($left_len);                                                                 
        for( $i = 0; $i -lt $left_len; $i++ )
        {
            $left[$i] = $arr[$i]
        }
        mergeSort -arr $left

        # sort the right side
        $right_len = $arr.Length - $left_len
        $right = [string[]]::new($right_len);
        for( $i = 0; $i -lt $right_len; $i++ )
        {
            $right[$i] = $arr[$left_len + $i]
        }
        mergeSort -arr $right

        # merge the two sides
        merge -result $arr -left $left -right $right

    }

}

function merge
{
    param ([string[]] $result,[string[]] $left, [string[]] $right)

    $i1 = 0
    $i2 = 0

    for ($i = 0; $i -lt $result.Length; $i++)
    {
        if( ($i1 -lt $left.Length) -and (($i2 -ge $right.Length) -or $left[$i1].CompareTo($right[$i2]) -lt 0) )
        {
            $result[$i] = $left[$i1]
            $i1++
        }
        else
        {
            $result[$i] = $right[$i2]
            $i2++
        }
    }

}

$array = [string[]] @("Ryan", "Kelly", "Alex", "Kyle", "Riley")
mergeSort $array

write-host $array

One thing to specifically call out is casting the input array to a string:

$array = [string[]] @("Ryan", "Kelly", "Alex", "Kyle", "Riley")

Without the cast, $array is of type [System.Object[]] and what happens is PowerShell will create a new temporary [string[]] array internally, copy the values into that and then sort the internal array, but it won't assign the internal array back to $array.

Try it without the cast to see that in action.

mclayton
  • 8,025
  • 2
  • 21
  • 26
  • When I try to import a CSV and pass into the $array the code does not sort. What I am missing? $array = [string[]] @() import-csv -path C:\Users\User\Downloads\majestic_million.csv| ForEach-Object{ $array += $_.Domain } mergeSort $array write-host $array – user716255 Apr 30 '23 at 20:38
  • 1
    (i) Avoid using the ```+=``` operator for performance-critical array appends - see the "note" sidebar here https://learn.microsoft.com/en-us/powershell/module/microsoft.powershell.core/about/about_arrays?view=powershell-7.3#manipulating-an-array – mclayton Apr 30 '23 at 20:52
  • 1
    (ii) ```+=``` is *also* turning your ```string[]]``` array back into a ```[system.object[]]``` array - try ```$x = [string[]] @("aaa"); write-host $x.GetType(); $x += "bbb"; write-host $x.GetType()``` - the first output is ```System.String[]``` and the second is ```System.Object[]```, so PowerShell does the "temporary internal copy" thing to make a new ```[string[]]``` to pass to ```mergeSort``` which is thrown away when ```mergeSort``` returns – mclayton Apr 30 '23 at 20:54
  • 1
    (iii) tbh, you might be better off turning all the ```[string[]]``` type references in the ```mergeSort``` and ```merge``` code into ```[object[]]```, then you don't need to cast your array to ```[string[]]```. You'll need to remember to case anything that's not already a ```[object[]]``` to an ```[object[]]``` before you pass it to ```mergeSort```, but that's less likely to trip you up as PowerShell generally prefers ```[object[]]```. – mclayton Apr 30 '23 at 20:59
  • I implemented you suggestion. If I removed "+= $_.domain" the it only shows the last value. With {$array = @([object[]]) import-csv -path C:\Users\User\Downloads\majestic_million.csv| ForEach-Object{ $array += $_.Domain } mergeSort $array write-host $array} Code does not sort and it returns. [System.Object[] facebook.com google.com youtube.com twitter.com instagram.com linkedin.com apple.co m microsoft.com wikipedia.org] – user716255 Apr 30 '23 at 21:11
  • Maybe open a separate question and reference this one. Your problem is that you’re passing the wrong type into the function so Powershell is casting it to a temporary variable to call the cmdlet, and throwing the result away afterwards. – mclayton Apr 30 '23 at 21:20
1

Use a sorted dictionary which has hash for key

$filename = 'c:\temp\test.csv'
$dict = [System.Collections.Generic.SortedDictionary[string,string]]::new()

$reader = [System.IO.StreamReader]::new($filename)
#skip header
$reader.ReadLine()
while( ($line = $reader.ReadLine()) -ne $null )
{
   if($line.Length.Trim() > 0)
   {
      $firstComma = $line.IndexOf(',')
      $id = $line.Substring(0, $firstComma)
      $dict.Add($id, $line)
   }
}
$dict
jdweng
  • 33,250
  • 2
  • 15
  • 20
  • 1
    It’s worth noting that this might work for the OP’s specific dataset, but general CSV data can contain line breaks and commas that would break using this approach. It might also give unexpected results if some strings are quoted and some are not - the quotes themselves would be sorted as part of the data. The OP also appears to want the data sorted by FirstName rather than I’d… – mclayton Apr 29 '23 at 11:14
  • 1
    @jdweng your solution is linear. Exactly what I'm trying to avoid. – JEuvin Apr 29 '23 at 16:54
  • @mclayton Yes I'm aware of all those possibilities. The final CSV will be sanitized by the previous process. I have not worried about it as of right now because I'm trying to find a way to process it via Powershell because of the environment. Otherwise, I would have done this in DotNet in 20 mins. – JEuvin Apr 29 '23 at 16:56
  • It is not linear. It is a dictionary which is a hash and runs at LOG(N). You can also split line on the comma and then use any column as the key of the dictionary. – jdweng Apr 29 '23 at 19:19
  • @jdweng generates an error message. Exception calling "Substring" with "2" argument(s): "Length cannot be less than zero. Parameter name: length" At line:10 char:4 + $id = $line.Substring(0, $firstComma) + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + CategoryInfo : NotSpecified: (:) [], ParentContainsErrorRecordException + FullyQualifiedErrorId : ArgumentOutOfRangeException – user716255 Apr 30 '23 at 20:44
  • Usually this error occurs if you have blank lines in the file. I added code to test for these lines. If you still get an error please post line that is causing error. – jdweng Apr 30 '23 at 23:09
1

If you do have performance issues with PowerShell you might start with reading PowerShell scripting performance considerations

Top-down implementation

With regards to your first attempt, one thing you might want to avoid is a recursive function simply because calling a function in PowerShell is pretty expensive, see: What is the best way to reuse static code
The specific error The script failed due to call depth overflow is most likely due an incorrect check where you should stop the recursed call and therefore going on forever...

Bottom-up implementation

With regards to your second attempt, using increase assignment operator (+=) to create a collection is a quiet common PowerShell performance gotcha, see: Why should I avoid using the increase assignment operator (+=) to create a collection

Merge Sort Prototype

Taking these two issues in consideration, you might come up with a function like:

function MergeSort($List, $By) {
    $Count = @($List).get_Count()
    if ($Count -le 1) { return $List }
    $Temp = New-Object PSCustomObject[] $Count
    for ($Width = 1; $Width -lt $Count; $Width = 2 * $Width) {
        for ($Start = 0; $Start -lt $Count; $Start = $Start + 2 * $Width) {
            $Left = $Right = $To = 0
            do {
                if (
                  $Right -ge $Width -or $Start + $Width + $Right -ge $Count -or (
                    $Left -lt $Width -and $Start + $Left -lt $Count -and 
                    $List[$Start + $Left].$By -lt $List[$Start + $Width + $Right].$By
                  )
                )    { $Temp[$Start + $To++] = $List[$Start + $Left++] }
                else { $Temp[$Start + $To++] = $List[$Start + $Width + $Right++] }
            } while ($To -lt 2 * $Width -and $Start + $To -lt $Count)
        }
        $List, $Temp = $Temp, $List # Swap (the references of) the lists
    }
    $List
}

Demo

$Data = ConvertFrom-Csv @'
id,firstname,lastname,email,email2,profession
100,Andeee,Michella,Andeee.Michella@yopmail.com,Andeee.Michella@gmail.com,police officer
101,Tybie,1Grobe,Tybie.Grobe@yopmail.com,Tybie.Grobe@gmail.com,worker
102,Fernande,Azeria,Fernande.Azeria@yopmail.com,Fernande.Azeria@gmail.com,developer
103,Lenna,Schenck,Lenna.Schenck@yopmail.com,Lenna.Schenck@gmail.com,police officer
104,4Marti,Brittani,Marti.Brittani@yopmail.com,Marti.Brittani@gmail.com,worker
105,Riannon,Aldric,Riannon.Aldric@yopmail.com,Riannon.Aldric@gmail.com,doctor
106,Corry,Nikaniki,Corry.Nikaniki@yopmail.com,Corry.Nikaniki@gmail.com,worker
107,Correy,Shama,Correy.Shama@yopmail.com,Correy.Shama@gmail.com,police officer
108,Marcy,Drus,Marcy.Drus@yopmail.com,Marcy.Drus@gmail.com,worker
109,Bill,Valerio,Bill.Valerio@yopmail.com,Bill.Valerio@gmail.com,worker
'@

MergeSort $Data -by lastname |Format-Table

id  firstname lastname email                       email2                    profession
--  --------- -------- -----                       ------                    ----------
101 Tybie     1Grobe   Tybie.Grobe@yopmail.com     Tybie.Grobe@gmail.com     worker
105 Riannon   Aldric   Riannon.Aldric@yopmail.com  Riannon.Aldric@gmail.com  doctor
102 Fernande  Azeria   Fernande.Azeria@yopmail.com Fernande.Azeria@gmail.com developer
104 4Marti    Brittani Marti.Brittani@yopmail.com  Marti.Brittani@gmail.com  worker
108 Marcy     Drus     Marcy.Drus@yopmail.com      Marcy.Drus@gmail.com      worker
100 Andeee    Michella Andeee.Michella@yopmail.com Andeee.Michella@gmail.com police officer
106 Corry     Nikaniki Corry.Nikaniki@yopmail.com  Corry.Nikaniki@gmail.com  worker
103 Lenna     Schenck  Lenna.Schenck@yopmail.com   Lenna.Schenck@gmail.com   police officer
107 Correy    Shama    Correy.Shama@yopmail.com    Correy.Shama@gmail.com    police officer
109 Bill      Valerio  Bill.Valerio@yopmail.com    Bill.Valerio@gmail.com    worker

But as already concluded in @mclayton helpful answer, it will be unlikely that you be able to beat the native Sort-Object with a self-made PowerShell function.

SortedDictionary

Anyhow, the suggestion mentioned in @jdweng helpful answer of using the SortedDictionary<TKey,TValue> Class is a better candidate to beat the native Sort-Object command.

$Dictionary = [System.Collections.Generic.SortedDictionary[string,object]]::new()
$Data.foreach{ $Dictionary[$_.lastname] = $_ }
$Dictionary.Values |Format-Table

Benchmark

I question your comment that the SortedDictionary "solution is linear", but I can't prove that. Anyways, I guess in the end it is the actual performance that counts here.

$Data = 1..10000 |Foreach-Object { [PSCustomObject]@{ Id = $_; Name = (New-Guid).Guid } }
(Measure-Command { 
    $Dict = [System.Collections.Generic.SortedDictionary[string,object]]::new()
    $Data.foreach{ $Dict[$_.Name] = $_ }
    $Null = $Dict.Values
}).TotalMilliseconds

(Measure-Command { 
    $SortedList = [System.Collections.Generic.SortedList[string,object]]::new()
    $Data.foreach{ $SortedList.Add($_.Name, $_ ) }
    $Null = $SortedList.Values
}).TotalMilliseconds

(Measure-Command { 
    $Null = $Data |Sort-Object -Property Name
}).TotalMilliseconds

(Measure-Command { 
    $Null = MergeSort $Data -By Name
}).TotalMilliseconds

Results

SortedDictionary:   110.8943
SortedList:         133.1337
Sort-Object:        141.4514
MergeSort Function: 550.4682

Divide and conquer

So, now we know that a SortedDictionary and a SortedList perform the best, let's see if we can divide the these .Net tasks over multiple parallel threads (CPUs).

Parallel Sort Prototype

For this I will need the SortedList (which is a little slower SortedDictionary) as I will need to be able to index into the returned lists.

function ParallelSort($List, $By, $ThrottleLimit = 3) {
    $Count = $Data.get_Count()
    $Size = [Math]::Ceiling($Count / $ThrottleLimit)
    $Lists = 
        for ($c = 0; $c -lt $ThrottleLimit; $c++) {
            $Start = $c * $Size
            $End = [Math]::Min(($Start + $Size), $Count) - 1
            ,$Data[$Start..$End]
        }
    $Lists = $Lists |Foreach-Object -ThrottleLimit $ThrottleLimit -Parallel {
        $List = [System.Collections.Generic.SortedList[string,object]]::new()
        foreach ($Item in $_) { $List.Add($Item.Name, $Item ) }
        ,$List.Values
    }

    $Indices = [UInt[]]::new($ThrottleLimit)
    Do {
        $lt = $Null
        for ($l = 0; $l -lt $ThrottleLimit; $l++) {
            if ($Indices[$l] -lt $Lists[$l].Count -and (
                $Null -eq $lt -or
                $Lists[$l][$Indices[$l]].$By -lt $Lists[$lt][$Indices[$lt]].$By)
            ) { $lt = $l }
        }
        if ($Null -ne $lt) { $Lists[$lt][$Indices[$lt]++] }
    } While ($Null -ne $lt)
}

Usage (sanity check)

$Data = 1..50 |Foreach-Object { [PSCustomObject]@{ Id = $_; Name = (New-Guid).Guid } }
ParallelSort $Data -By Name

Which a large amount of data (e.g. 200.000 entries), the merged-parallel sorted routine indeed appears to be faster then the single process SortedDictionary:

Merged-parallel:  4324.5299
SortedDictionary: 5335.0125
iRon
  • 20,463
  • 10
  • 53
  • 79
  • The "faster sort" question intrigued me, so I spend some more time to create a parallel sort function., which I appears indeed somewhat faster.(see updated answer). – iRon May 04 '23 at 19:18