2

I have a performance issue with the below Powershell script. I want to parse information from CSV file to JSON file. The performance of this conversion is very good when the CSV file size is small. But the execution is not completing/hang when I ran the script for CSV file size like 200MB or above.

import-csv -path "F:\csvs\stack.csv" -UseCulture -Encoding UTF8 | 
select "Name", "Mobile","DOB","Email",@{n='Father_Name'; e={$($_."Father Name")}}, @{n = 'Other_mobile_no'; e = { [long]$_."Other mobile no." } },"Pincode","State" | 
ConvertTo-Json -Compress -Depth 100 | Add-Content -Encoding ASCII -Path "F:\csvs\stack.json"

Could you please let me know if you see any improvements of the script or any changes that I could do?

sridharnetha
  • 2,104
  • 8
  • 35
  • 69
  • 2
    Just curious.. Why import a csv with `-Encoding UTF8` and then write the json result to a file using `-Encoding ASCII` ?? – Theo Aug 28 '22 at 11:10
  • I added `-Encoding` for editor purpose. I noticed that no performance improvement with or without `-Encoding` – sridharnetha Aug 28 '22 at 11:25
  • This wasn't about performance, but possible loss of utf8 characters – Theo Aug 28 '22 at 11:34
  • 2
    As an aside, in addition to the `-Encoding` problem: Unless you really want to _append_ to a _preexisting_ file, use [`Set-Content`](https://learn.microsoft.com/powershell/module/microsoft.powershell.management/set-content) rather than [`Add-Content`](https://learn.microsoft.com/powershell/module/microsoft.powershell.management/add-content). – mklement0 Aug 28 '22 at 12:37
  • The pipeline is very flexible but it comes with an overhead that becomes significant when doing large amounts of low-level data crunching. You will propably get better performance using .NET methods. Do you need to support PS 5 or could live with a PS 7+ solution? – zett42 Aug 28 '22 at 13:16
  • Unless you're willing to re-write `Import-Csv` I don't see how you can improve this... Have you tried loading all the CSV in memory and then looping over each item ? – Santiago Squarzon Aug 28 '22 at 16:29

3 Answers3

2

A simple way of performance improvement is to avoid pipeline commands as much as possible, at the expense of requiring more memory. The pipeline is very flexible but for several reasons there is an overhead that becomes significant when doing large amounts of low-level data crunching.

  • Select-Object creates PSCustomObject which has some inherent overhead due to its dynamic nature. As Santiago Squarzon points out, a faster way is to use a class instead, which also uses less memory.
  • For instanciating such a class, avoid ForEach-Object though. As pointed out by mklement0, due to the way scripts are called, it is much slower than a foreach(x in y) loop. What turned out to be even faster (and more memory efficient) than a foreach loop, is piping directly to a simple script block | & {…} (which is able to do pipeline processing by using a process section). Piping to a non-extended function without parameters, that only contains a process section works as well.
  • Finally, by explicitly passing the data to the -InputObject parameter of ConvertTo-Json (instead of piping it), another small performance gain can be achieved, because this prevents unrolling of the input array.
$inputPath  = 'F:\csvs\stack.csv'
$outputPath = 'F:\csvs\stack.json'

# Defines the CSV columns
class MyCsvData {
    [string] $Name
    [string] $Mobile
    [string] $DOB
    [string] $Email
    [string] $Father_Name
    [object] $Other_mobile_no
    [string] $Pincode
    [string] $State
}

# Parse the whole CSV file into memory and transform the columns.
# Piping to a script block instead of ForEach-Object is much faster.
$csv = Import-Csv -path $inputPath -Encoding UTF8 | & { 
    process {
        # Try to convert 'Other mobile no.' field to long
        $Other_mobile_no = 0l
        if( -not [long]::TryParse( $_.'Other mobile no.', [ref] $Other_mobile_no ) ) {
            $Other_mobile_no = '--'
        }

        # Implicit output, gets captured in $csv.
        [MyCsvData] @{ 
            Name            = $_.Name
            Mobile          = $_.Mobile
            DOB             = $_.DOB
            Email           = $_.Email
            Father_Name     = $_.'Father Name'
            Other_mobile_no = $Other_mobile_no
            Pincode         = $_.Pincode
            State           = $_.State
        }
    }
}
 
ConvertTo-Json -InputObject $csv -Compress | Set-Content -Encoding UTF8 -Path $outputPath

Using my own sample data this code runs about 5 times faster than your original code.

Bigger performance improvements could be possible by rewriting the inner loop in (inline) C# and using a .NET CSV parser. This way we could also get around the memory overhead of the current solution.

I did a preliminary test using the TextFieldParser class, which cut down the runtime by another half. I had expected more from a C# solution and comments here also suggest it isn't the fastest. It could be worth a try to test another one (see CSV parser benchmark).

zett42
  • 25,437
  • 3
  • 35
  • 72
  • Pipeline Processing is actually one of the most efficient and balanced ways of processing data, there is no overhead with the pipeline, the overhead is with parameter binding. – Santiago Squarzon Aug 28 '22 at 19:00
  • 1
    @SantiagoSquarzon Thanks for the correction, hopefully the wording is better now. – zett42 Aug 28 '22 at 19:25
  • 1
    another improvement would be a simple class instead of `pscustomobject`, it would cut the runtime by half more or less and consume less memory – Santiago Squarzon Aug 28 '22 at 19:33
  • 1
    @SantiagoSquarzon I was just thinking about that. :) – zett42 Aug 28 '22 at 19:33
  • 1
    [here](https://gist.github.com/santysq/99a155f95b6838b9902e55e3b657a4e0) is a simple example of what I meant by pipeline processing being one of the most efficient and balanced ways of processing, it's almost as fast as the fastest loop available in pwsh. – Santiago Squarzon Aug 28 '22 at 19:37
  • @SantiagoSquarzon can you give an example for a simple class instead of `pscustomobject' using the @zett42 answer. – sridharnetha Aug 28 '22 at 19:42
  • @sridharnetha look in the gist i linked above – Santiago Squarzon Aug 28 '22 at 19:43
  • @sridharnetha I've updated the answer to use a class. Indeed it cuts down the time by half compared to `PSCustomObject`. – zett42 Aug 28 '22 at 19:53
  • @SantiagoSquarzon, @zett42 I changed the line to `Other_mobile_no = If([bool]($row."Other mobile no." -as [long] -is [long])){[long]$row."Other mobile no."}Else{"--"}` but the performance is very poor. I tested for a CSV file and the size of the file is `~190MB` – sridharnetha Aug 28 '22 at 20:14
  • 1
    @sridharnetha Ok, so we need to arm the .NET/C# cannon for further improvements. Doing a quick test using `TextFieldParser` and `Utf8JsonWriter` I have converted a 200 MB CSV in about 20 s, which is a massive improvement over the current solution, which took about 140 s. Need to clean the code up and will post this as an alternative solution then. – zett42 Aug 28 '22 at 22:10
  • Indeed, the pipeline itself isn't to blame for poor performance, but it is the implementation of `ForEach-Object` (ditto for `Where-Object`). @Santagio has avoided that in the linked Gist. I wouldn't call that a performance problem due to parameter binding per se, however: it is primarily a problem of how the script block passed to `ForEach-Object` is invoked - see [GitHub issue #10982](https://github.com/PowerShell/PowerShell/issues/10982) for the - complicated - backstory. – mklement0 Aug 29 '22 at 01:55
  • Thus, it seems that, assuming a `foreach` loop or a speed-optimized pipeline that avoids `ForEach-Object`, the speedup comes primarily from avoiding `Select-Object` with calculated properties in favor of constructing instances of the custom class. – mklement0 Aug 29 '22 at 02:29
  • @mklement0 i've have updated my gist, still shows that class outperforms `pscustomobject` when filtering properties from objects – Santiago Squarzon Aug 29 '22 at 02:35
  • @Santiago, with `Import-Csv` in the picture, you're _invariably_ getting `[pscustomobject]`s _first_, and any conversion to custom-class instances amounts to _additional_ effort. However, that effort appears to be lower than an additional `Select-Object` call with calculated properties (which makes sense). – mklement0 Aug 29 '22 at 02:57
  • @mklement0 the use of `Select-Object` by OP hints that he needs to filter properties from the imported CSV objects, hence new objects have to be created which is what the test of my gist is doing. Filtering properties from an object with a class is faster than with `pscustomobject` or `Select-Object` – Santiago Squarzon Aug 29 '22 at 03:03
  • 2
    @Santiago, yes, but my point is that the speedup is _solely_ related to replacing the costly `Select-Object` call with better-performing construction of custom-class instances (constructing _new_ `[pscustomboject]`s from what `Import-Csv` returns was never really in the picture). The cost of constructing the initial `[pscustomobject]`s from the CSV source cannot be avoided, if `Import-Csv` is used. If the custom-class optimization alone isn't enough and the bottleneck is in the `Import-Csv` call, plain-text parsing is needed, or use of compiled code. – mklement0 Aug 29 '22 at 03:10
  • 1
    @mklement0 I agree in that case, the most sensible thing to do would be to use `TextFieldParser` as zett42 pointed in his answer though we can't help with that unless we have a plain text representation of the CSV – Santiago Squarzon Aug 29 '22 at 03:12
  • @sridharnetha I've added code for more efficient conversion to `long`, using `[long]::TryParse()`. On my system the current code runs about 45 s for a 200 MB file, that's propably as fast as it can be using pure PowerShell solution. I haven't found time to further develop .NET/C# solution yet. – zett42 Aug 29 '22 at 10:08
  • Kudos for all the optimizations, but I do think `Import-Csv` itself has a problem, both in terms of execution speed and memory, and I think the proposal in [GitHub issue #8862](https://github.com/PowerShell/PowerShell/issues/8862) would make a big difference. If you're up for it, I suggest adding it to the answer. – mklement0 Sep 06 '22 at 17:23
  • 1
    @Santiago, because I came later to the discussion, I didn't originally fully appreciate the work you did in your benchmarks. Given that that is a worthy discussion in its own right - comparing the performance of `Select-Object` with `ForEach-Object` / `. { process { ... } }` + `[pscustomobject] @{ ... }` literals vs. custom `class` instances (with field-individual assignments vs. initialization by hashtable) - I invite you to create a self-answered question that highlights your findings, if you're up for it. – mklement0 Sep 06 '22 at 17:24
  • 2
    @mklement0 I was working on a solution using [`CsvHelper`](https://joshclose.github.io/CsvHelper/) and NewtonSoft.JSON, which indeed gave a big performance improvement. Then I got a bit demotivated when OP already posted an answer. ;-) Might have a look at this again. – zett42 Sep 06 '22 at 17:46
  • 1
    @mklement0 thanks for the comment, appreciated. I might post an answer later but same as zett42, don't feel much motivation to post an answer when OP has posted his self-answer using Python. - @zett42 Using .NET to parse the CSV would definitely bring better results as for Json, if you want to have a Windows Pwsh compatible answer you need to also worry about `System.Web.Script.Serialization.JavaScriptSerializer` since it's used to parse Json there – Santiago Squarzon Sep 06 '22 at 17:56
  • 1
    Understood re motivation with respect to _this_ question, @Santiago, but I was thinking that a _separate_, "canonical" q & a that compares the speed of object-construction techniques in PowerShell _in general_ may be beneficial. You've done much of the work already. – mklement0 Sep 06 '22 at 18:01
1

Answering to my question, This answer may helpful to others who are interested to use the python.

I had to use a python script to read CSV files and loading as JSON files. Dask is a python library for parallel computing. In my situation Dask faster than the Powershell.

Here is my python script using a Dask library: Note that Dask should chunk data the data automatically and save the files with .part extension. I tested this script on a 250MB CSV file and it converted to JSON file(s) within the 16seconds only.

import dask.dataframe as dd
def e2f(string):
try:
    result = float(string)
except ValueError:
    result=string
return str(result).replace('.0','').replace('nan','')

cols=['Name','Mobile','Email','Father Name','Other mobile no.','Pincode','State']

df=dd.read_csv('F:\csvs\stack.csv', low_memory=False, usecols=cols, sep = ",", header = 0, 
converters={'Name':e2f,'Mobile':e2f,'DOB':e2f,'Email':e2f,'Father Name':e2f,'Other mobile no.':e2f,'Pincode':e2f,'State':e2f},
#dtype={'Name':str,'Mobile':str,'DOB':str,'Email':str,'Father Name':str,'Other mobile no.':str,'Pincode':str,'State':str})
.rename(columns=lambda x: x.replace(' ', '_').replace('.',''))

df.to_json(r'F:\csvs\myfiles', orient ='records', compression = 'infer', date_format = "epoch", double_precision = 10, force_ascii = True, date_unit = "ms", default_handler = None,lines=False )
sridharnetha
  • 2,104
  • 8
  • 35
  • 69
0

The select-object seems expensive, especially with custom objects.

if (test-path stack.json) { rm stack.json }
import-csv stack.csv -UseCulture -Encoding UTF8 | 
select Name,Mobile,DOB,Email,
@{n = 'Father_Name';     e = { $_.'Father Name' } },
@{n = 'Other_mobile_no'; e = { [long]$_.'Other mobile no.' } },
Pincode,State | 
ConvertTo-Json -Compress -Depth 100 |
add-Content -Encoding ASCII stack.json

# create example
# 'name,mobile,dob,email,father name, other mobile no.,pincode,state' | set-content stack.csv
# 1..224kb | % { 'joe,5555555,5555555,me@home,joe,6666666,1234,ny' } | add-content stack.csv # 11mb csv, 1 min 11 sec
# measure-command { .\script.ps1 }

# default 1:11
# set-content 1:56
# $out = :59
# no select :24
# select but no @{} :25
# with "Father Name" :37
# with [long]"Other mobile no." 1:02
# same without [long]
js2010
  • 23,033
  • 6
  • 64
  • 66