How to sort 30Million csv records in Powershell

Question

I am using oledbconnection to sort the first column of csv file. Oledb connection is executed up to 9 million records within 6 min duration successfully. But when am executing 10 million records, getting following alert message.

Exception calling "ExecuteReader" with "0" argument(s): "The query cannot be completed. Either the size of the query result is larger than the maximum size of a database (2 GB), or there is not enough temporary storage space on the disk to store the query result."

is there any other solution to sort 30 million using Powershell?

here is my script

$OutputFile = "D:\Performance_test_data\output1.csv"
$stream = [System.IO.StreamWriter]::new( $OutputFile )

$sb = [System.Text.StringBuilder]::new()
$sw = [Diagnostics.Stopwatch]::StartNew()

$conn = New-Object System.Data.OleDb.OleDbConnection("Provider=Microsoft.ACE.OLEDB.12.0;Data Source='D:\Performance_test_data\';Extended Properties='Text;HDR=Yes;CharacterSet=65001;FMT=Delimited';")
$cmd=$conn.CreateCommand()
$cmd.CommandText="Select * from 1crores.csv order by col6"

$conn.open()

$data = $cmd.ExecuteReader()

echo "Query has been completed!"
$stream.WriteLine( "col1,col2,col3,col4,col5,col6")

while ($data.read()) 
{ 
  $stream.WriteLine( $data.GetValue(0) +',' + $data.GetValue(1)+',' + $data.GetValue(2)+',' + $data.GetValue(3)+',' + $data.GetValue(4)+',' + $data.GetValue(5))

}
echo "data written successfully!!!"

$stream.close()
$sw.Stop()
$sw.Elapsed

$cmd.Dispose()
$conn.Dispose()

I think this is a limitation of the the MS OLE.DB. Have you tried just importing the csv file using Import-Csv command and then sorting using Sort-Object? — Daniel, Feb 05 '21 at 05:29
@Daniel, Thanks, tried Import-csv with Sort-Object, it is too slow and system went to "not responding state") — Ritthisha S, Feb 05 '21 at 05:32
Right. I guess you wouldn't have gone through the trouble of using OLE otherwise! Maybe SQLite might be worth a try? — Daniel, Feb 05 '21 at 05:46
So the actual question is how to sort large files efficiently in powershell? — James Z, Feb 05 '21 at 07:07
What is the slow part? `Import-Csv` or `Sort-Object`? Which version of Powershell are you using? Last, please include some sample data in your post. — zett42, Feb 05 '21 at 08:05
To sort a stream you need to stall it as the last record might potentially become first... — iRon, Feb 05 '21 at 08:22
If memory for `Import-Csv` is not a problem with such a large file, then [this](https://stackoverflow.com/a/66011619/9898643) can help sorting faster. — Theo, Feb 05 '21 at 10:22
@theo, when we run with this (array) system go hanged and powershell got stopped — Ritthisha S, Feb 05 '21 at 11:18
Well.. it was worth a try, but apparently the file is simply too large to do all this in the available memory.. In that case, go for the SQLite option in [Steven's answer](https://stackoverflow.com/a/66061603/9898643) — Theo, Feb 05 '21 at 11:23
What you can do (basic flow): Use a streamreader to read the CSV file. Then for each line you read build a PSCustomObject and stick this in a collection (list, array etc). Then sort this using Linq. Afterwards you can output the data again as csv, json or write it directly to a database — bluuf, Feb 05 '21 at 14:05
Interesting question. I wonder if someone has an "external sort" powershell module, sort pieces then merge. I think the linux sort can do it. In my tests, 1mb objects take about 400mb to load, sort, and save, so maybe 11gb required ram at least. — js2010, Feb 08 '21 at 20:53

Eric Eskildsen · Answer 1 · 2021-02-18T14:48:35.183

3

You could try SQLite:

$OutputFile = "D:\Performance_test_data\output1.csv"

$sw = [Diagnostics.Stopwatch]::StartNew()

sqlite3 output1.db '.mode csv' '.import 1crores.csv 1crores' '.headers on' ".output $OutputFile" 'Select * from 1crores order by 最終アクセス日時'

echo "data written successfully!!!"

$sw.Stop()
$sw.Elapsed

edited Feb 18 '21 at 14:48

answered Feb 05 '21 at 10:30

Eric Eskildsen

4,269
2
38
55

where we need to mention the input path folder? as m getting : The term 'sqlite3' is not recognized as the name of a cmdlet, function, script file, or operable program. Check the spelling of the name, or if a path was included, verify that the path is correct and try again. – Ritthisha S Feb 05 '21 at 11:20
also i installed sqlite using following command Install-Module -Name SQLite – Ritthisha S Feb 05 '21 at 11:31
2

sqlite3 is an exe. You can download it from [here](https://www.sqlite.org/download.html) – Daniel Feb 05 '21 at 15:52
1

@RitthishaS, as Daniel's comment explains, sqlite3 is an executable. You can add its directory to your PATH environment variable to call it by its unqualified name as I do in my answer, or you can qualify it with the path of the directory where it lives. – Eric Eskildsen Feb 08 '21 at 00:11
Have implemented the logic and works fine, But the problem is , it is taking up to 98% of the memory when am executing with 20Million records.(file size :1.76GB) and execution has done in 15min. is there any other ways to optimize the memory usage? – Ritthisha S Feb 08 '21 at 08:32
@Eric Eskildsen - Its working perfect, except memory issue. One more doubt that, how to set encoding for this csv before read. – Ritthisha S Feb 08 '21 at 12:05
@RitthishaS Great to hear! For the memory issues and encoding, those are SQLite questions. You should find or create separate StackOverflow questions for them. – Eric Eskildsen Feb 08 '21 at 13:51
@RitthishaS That said, one thing you could try is using SQLite's `.output` instead of piping it back to PowerShell. I'll edit my answer to show how to do that. – Eric Eskildsen Feb 08 '21 at 13:54

bluuf · Answer 2 · 2021-02-05T14:58:19.943

You can try using this:

$CSVPath = 'C:\test\CSVTest.csv'
$Delimiter = ';'

# list we use to hold the results
$ResultList = [System.Collections.Generic.List[Object]]::new()

# Create a stream (I use OpenText because it returns a streamreader)
$File = [System.IO.File]::OpenText($CSVPath)

# Read and parse the header
$HeaderString = $File.ReadLine()

# Get the properties from the string, replace quotes
$Properties = $HeaderString.Split($Delimiter).Replace('"',$null)
$PropertyCount = $Properties.Count

# now read the rest of the data, parse it, build an object and add it to a list
while ($File.EndOfStream -ne $true)
{
    # Read the line
    $Line = $File.ReadLine()
    # split the fields and replace the quotes
    $LineData = $Line.Split($Delimiter).Replace('"',$null)
    # Create a hashtable with the properties (we convert this to a PSCustomObject later on). I use an ordered hashtable to keep the order
    $PropHash = [System.Collections.Specialized.OrderedDictionary]@{}
    # if loop to add the properties and values
    for ($i = 0; $i -lt $PropertyCount; $i++)
    { 
        $PropHash.Add($Properties[$i],$LineData[$i])
    }
    # Now convert the data to a PSCustomObject and add it to the list
    $ResultList.Add($([PSCustomObject]$PropHash))
}

# Now you can sort this list using Linq:
Add-Type -AssemblyName System.Linq
# Sort using propertyname (my sample data had a prop called "Name")
$Sorted = [Linq.Enumerable]::OrderBy($ResultList, [Func[object,string]] { $args[0].Name })

Instead of using import-csv I've written a quick parser which uses a streamreader and parses the CSV data on the fly and puts it in a PSCustomObject. This is then added to a list.

edit: fixed the linq sample

This is working, but when am executing with 10million record, Memory reaching upto 97%. Is there any other way that without storing the data into memory? Thanks for the help. — Ritthisha S, Feb 08 '21 at 06:25
Since each record takes up an x amount of memory you need to have enough memory to hold the records or use sonething like SQL. The problem with 10 milion records is: if a single record is 1KB and you multiply it with 10 mil you end up with 10m * 1KB = 10mKB which is roughly 10GB. — bluuf, Feb 08 '21 at 07:16
can we try with ODBC connection? do you have any sample code for odbc? thanks in advance. (note: I already tried with oledb, got an error like "reached database limit 2GB". — Ritthisha S, Feb 08 '21 at 07:19

iRon · Answer 3 · 2021-02-11T13:02:53.617

Putting the performance aside and at least come to a solution that works (meaning one that doesn't hang due to memory shortage) I would rely on the PowerShell pipeline. The issue is thou that for sorting an object you will need to stall te pipeline as the last object might potentially become the first object.
To resolve this part, I would do a coarse division on the first character(s) of the concern property first. Once that is done, fine sort each coarse division and append the results:

Function Sort-BigObject {
    [CmdletBinding()] param(
        [Parameter(ValueFromPipeLine = $True)]$InputObject,
        [Parameter(Position = 0)][String]$Property,
        [ValidateRange(1,9)]$Coarse = 1,
        [System.Text.Encoding]$Encoding = [System.Text.Encoding]::Default
    )
    Begin {
        $TemporaryFiles = [System.Collections.SortedList]::new()
    }
    Process {
        if ($InputObject.$Property) {
            $Grain = $InputObject.$Property.SubString(0, $Coarse)
            if (!$TemporaryFiles.Contains($Grain)) { $TemporaryFiles[$Grain] = New-TemporaryFile }
            $InputObject | Export-Csv $TemporaryFiles[$Grain] -Encoding $Encoding -Append
        } else { $InputObject.$Property }
    }
    End {
        Foreach ($TemporaryFile in $TemporaryFiles.Values) {
            Import-Csv $TemporaryFile -Encoding $Encoding | Sort-Object $Property
            Remove-Item -LiteralPath $TemporaryFile
        }
    }
}

Usage
_{(Don't assign the stream to a variable and don't use parenthesis.)}

Import-Csv .\1crores.csv | Sort-BigObject <PropertyName> | Export-Csv .\output.csv

If the temporary files still get too big to handle, you might need to increase the -Coarse parameter

Caveats (improvement considerations)

Objects with an empty sort property will be immediately outputted
The sort column is presumed to be a (single) string column
I presume the performance is poor (I didn't do a full test on 30 million records, but 10.000 records take about 8 second which means about 8 hours). Consider replacing native PowerShell cmdlets with .Net streaming methods. buffer/cache file input and outputs, parallel processing?

iRon · Accepted Answer · 2021-02-15T14:11:32.733

I have added a new answer as this is a complete different approach to tackle this issue.
Instead of creating temporary files (which presumable causes a lot of file opens and closures), you might consider to create a ordered list of indices and than go over the input file (-FilePath) multiple times and each time, process a selective number of lines (-BufferSize = 1Gb, you might have to tweak this "memory usage vs. performance" parameter):

Function Sort-Csv {
    [CmdletBinding()] param(
        [string]$InputFile,
        [String]$Property,
        [string]$OutputFile,
        [Char]$Delimiter = ',',
        [System.Text.Encoding]$Encoding = [System.Text.Encoding]::Default,
        [Int]$BufferSize = 1Gb
    )
    Begin {
        if ($InputFile.StartsWith('.\')) { $InputFile = Join-Path (Get-Location) $InputFile }
        $Index = 0
        $Dictionary = [System.Collections.Generic.SortedDictionary[string, [Collections.Generic.List[Int]]]]::new()
        Import-Csv $InputFile -Delimiter $Delimiter -Encoding $Encoding | Foreach-Object { 
            if (!$Dictionary.ContainsKey($_.$Property)) { $Dictionary[$_.$Property] = [Collections.Generic.List[Int]]::new() }
            $Dictionary[$_.$Property].Add($Index++)
        }
        $Indices = [int[]]($Dictionary.Values | ForEach-Object { $_ })
        $Dictionary = $Null                                     # we only need the sorted index list
    }
    Process {
        $Start = 0
        $ChunkSize = [int]($BufferSize / (Get-Item $InputFile).Length * $Indices.Count / 2.2)
        While ($Start -lt $Indices.Count) {
            [System.GC]::Collect()
            $End = $Start + $ChunkSize - 1
            if ($End -ge $Indices.Count) { $End = $Indices.Count - 1 }
            $Chunk = @{}
            For ($i = $Start; $i -le $End; $i++) { $Chunk[$Indices[$i]] = $i }
            $Reader = [System.IO.StreamReader]::new($InputFile, $Encoding)
            $Header = $Reader.ReadLine()
            $i = $Start
            $Count = 0
            For ($i = 0; ($Line = $Reader.ReadLine()) -and $Count -lt $ChunkSize; $i++) {
                if ($Chunk.Contains($i)) { $Chunk[$i] = $Line }
            }
            $Reader.Dispose()
            if ($OutputFile) {
                if ($OutputFile.StartsWith('.\')) { $OutputFile = Join-Path (Get-Location) $OutputFile }
                $Writer = [System.IO.StreamWriter]::new($OutputFile, ($Start -ne 0), $Encoding)
                if ($Start -eq 0) { $Writer.WriteLine($Header) }
                For ($i = $Start; $i -le $End; $i++) { $Writer.WriteLine($Chunk[$Indices[$i]]) }
                $Writer.Dispose()
            } else {
                $Start..$End | ForEach-Object { $Header } { $Chunk[$Indices[$_]] } | ConvertFrom-Csv -Delimiter $Delimiter
            }
            $Chunk = $Null
            $Start = $End + 1
        }
    }
}

Basic usage

Sort-Csv .\Input.csv <PropertyName> -Output .\Output.csv
Sort-Csv .\Input.csv <PropertyName> | ... | Export-Csv .\Output.csv

Note that for 1Crones.csv it will probably just export the full file in once unless you set the -BufferSize to a lower amount e.g. 500Kb.

Many thanks @iRon, Passing input file with 6rows, but its resulting only two rows. When i putting write-host in the below loop, it has written all the 6rows.But final result has 2 rows. For ($i = 0; ($Line = $Reader.ReadLine()) -and $Count -lt $ChunkSize; $i++) { if ($Chunk.Contains($i)) { $Chunk[$i] = $Line } Write-Host $Line } — Ritthisha S, Feb 15 '21 at 10:35
Actually it is removing duplicate rows for property(passing Column) value. As i want duplicate values too. — Ritthisha S, Feb 15 '21 at 10:50
Thanks @iron, Following data csv file has five rows, But output file has four rows only. Seems one row(First Rows) is missing. partially duplicate scenario working. "Col1","Col2","Col3","Col4","Col5","Col6" "a",".txt","1","2014/10/29 1:00","2018/4/1 1:00","2017/4/1 1:00" "b",".txt","1","2014/10/29 1:00","2018/4/1 1:00","2012/4/1 1:00" "c",".txt","1","2014/10/29 1:00","2018/4/1 1:00","2013/4/1 1:00" "e",".txt","1","2014/10/29 1:00","2018/4/1 1:00","2013/4/1 1:00" "aa",".txt","1","2014/10/29 1:00","2018/4/1 1:00","2017/4/1 1:00" — Ritthisha S, Feb 15 '21 at 13:32
First row of col6(2017/4/1 1:00) value is repeating then, automatically first row gets removed. — Ritthisha S, Feb 15 '21 at 13:37
Sorry for not mentioning sorting column name, am sorting col6( last column ) — Ritthisha S, Feb 15 '21 at 13:38
Thank you so much @iRon. It is working perfectly. Performance wise it is good. Anyhow I went with python pandas. There it is completed in less than 30 min for 3million rows. — Ritthisha S, Feb 16 '21 at 18:21

js2010 · Answer 5 · 2021-02-09T20:08:06.030

I downloaded gnu sort.exe from here: http://gnuwin32.sourceforge.net/packages/coreutils.htm It also requires libiconv2.dll and libintl3.dll from the dependency zip. I basically did this within cmd.exe, and it used a little less than a gig of ram and took about 5 minutes. It's a 500 meg file of about 30 million random numbers. This command can also merge sorted files with --merge. You can also specify begin and end key position for sorting --key. It automatically uses temp files.

.\sort.exe < file1.csv > file2.csv

Actually it works in a similar way with the windows sort from the cmd prompt. The windows sort also has a /+n option to specify what character column to start the sort by.

sort.exe < file1.csv > file2.csv

How to sort 30Million csv records in Powershell

5 Answers5

Linked