2

Context

In this post:

ConvertFrom-Json with large file

I ask about deserializing a 1.2GB JSON file.

This answer posted there:

https://stackoverflow.com/a/76791900/268581

does work, but it's extremely slow.

Sample data

So that you don't have to use a 1.2GB file, here's a small data example for use with this question. It's just the first few items from the original large JSON file.

example.json:

[{"py/object": "polygon.websocket.models.models.EquityTrade", "event_type": "T", "symbol": "O:AMD230728C00115000", "exchange": 304, "id": null, "tape": null, "price": 0.38, "size": 1, "conditions": [227], "timestamp": 1690471217275, "sequence_number": 1477738810, "trf_id": null, "trf_timestamp": null}, {"py/object": "polygon.websocket.models.models.EquityTrade", "event_type": "T", "symbol": "O:AFRM230728C00019500", "exchange": 302, "id": null, "tape": null, "price": 0.07, "size": 10, "conditions": [209], "timestamp": 1690471217278, "sequence_number": 1477739110, "trf_id": null, "trf_timestamp": null}, {"py/object": "polygon.websocket.models.models.EquityTrade", "event_type": "T", "symbol": "O:TSLA230804C00270000", "exchange": 325, "id": null, "tape": null, "price": 4.8, "size": 7, "conditions": [219], "timestamp": 1690471217282, "sequence_number": 341519150, "trf_id": null, "trf_timestamp": null}, {"py/object": "polygon.websocket.models.models.EquityTrade", "event_type": "T", "symbol": "O:TSLA230804C00270000", "exchange": 312, "id": null, "tape": null, "price": 4.8, "size": 1, "conditions": [209], "timestamp": 1690471217282, "sequence_number": 341519166, "trf_id": null, "trf_timestamp": null}, {"py/object": "polygon.websocket.models.models.EquityTrade", "event_type": "T", "symbol": "O:TSLA230804C00270000", "exchange": 312, "id": null, "tape": null, "price": 4.8, "size": 1, "conditions": [209], "timestamp": 1690471217282, "sequence_number": 341519167, "trf_id": null, "trf_timestamp": null}, {"py/object": "polygon.websocket.models.models.EquityTrade", "event_type": "T", "symbol": "O:TSLA230804C00270000", "exchange": 319, "id": null, "tape": null, "price": 4.8, "size": 5, "conditions": [219], "timestamp": 1690471217282, "sequence_number": 341519170, "trf_id": null, "trf_timestamp": null}, {"py/object": "polygon.websocket.models.models.EquityTrade", "event_type": "T", "symbol": "O:TSLA230804C00270000", "exchange": 312, "id": null, "tape": null, "price": 4.8, "size": 19, "conditions": [209], "timestamp": 1690471217284, "sequence_number": 341519682, "trf_id": null, "trf_timestamp": null}, {"py/object": "polygon.websocket.models.models.EquityTrade", "event_type": "T", "symbol": "O:TSLA230804C00270000", "exchange": 301, "id": null, "tape": null, "price": 4.8, "size": 2, "conditions": [219], "timestamp": 1690471217290, "sequence_number": 341519926, "trf_id": null, "trf_timestamp": null}, {"py/object": "polygon.websocket.models.models.EquityTrade", "event_type": "T", "symbol": "O:TSLA230804C00270000", "exchange": 301, "id": null, "tape": null, "price": 4.8, "size": 15, "conditions": [219], "timestamp": 1690471217290, "sequence_number": 341519927, "trf_id": null, "trf_timestamp": null}, {"py/object": "polygon.websocket.models.models.EquityTrade", "event_type": "T", "symbol": "O:META230728C00315000", "exchange": 302, "id": null, "tape": null, "price": 4.76, "size": 1, "conditions": [227], "timestamp": 1690471217323, "sequence_number": 1290750877, "trf_id": null, "trf_timestamp": null}]

Code

Here's (slow) code that works. It takes hours to run on the 1.2GB file.

$path = ".\example.json"

$stream = [System.IO.File]::Open($path, [System.IO.FileMode]::Open)

$i = 0
$stream.ReadByte() # read '['
$i++

$json = ''

$data = @()

while ($i -lt $stream.Length)
{
    $byte = $stream.ReadByte(); $i++

    $char = [Convert]::ToChar($byte)
            
    if ($char -eq '}')
    {
        $json = $json + [Convert]::ToChar($byte)
        
        $data = $data + ($json | ConvertFrom-Json)

        $json = ''

        $stream.ReadByte() | Out-Null # read comma;
        $i++

        if ($data.Count % 100 -eq 0)
        {
            Write-Host $data.Count
        }
    }
    else
    {
        $json = $json + [Convert]::ToChar($byte)
    }
}

$stream.Close()

After running it, you should have the records in $data:

PS C:\Users\dharm\Dropbox\Documents\polygon-io.ps1> $data | ft *

py/object                                   event_type symbol                exchange id tape price size conditions     timestamp sequence_number trf_id trf_timestamp
---------                                   ---------- ------                -------- -- ---- ----- ---- ----------     --------- --------------- ------ -------------
polygon.websocket.models.models.EquityTrade T          O:AMD230728C00115000       304          0.38    1 {227}      1690471217275      1477738810
polygon.websocket.models.models.EquityTrade T          O:AFRM230728C00019500      302          0.07   10 {209}      1690471217278      1477739110
polygon.websocket.models.models.EquityTrade T          O:TSLA230804C00270000      325           4.8    7 {219}      1690471217282       341519150
polygon.websocket.models.models.EquityTrade T          O:TSLA230804C00270000      312           4.8    1 {209}      1690471217282       341519166
polygon.websocket.models.models.EquityTrade T          O:TSLA230804C00270000      312           4.8    1 {209}      1690471217282       341519167
polygon.websocket.models.models.EquityTrade T          O:TSLA230804C00270000      319           4.8    5 {219}      1690471217282       341519170
polygon.websocket.models.models.EquityTrade T          O:TSLA230804C00270000      312           4.8   19 {209}      1690471217284       341519682
polygon.websocket.models.models.EquityTrade T          O:TSLA230804C00270000      301           4.8    2 {219}      1690471217290       341519926
polygon.websocket.models.models.EquityTrade T          O:TSLA230804C00270000      301           4.8   15 {219}      1690471217290       341519927
polygon.websocket.models.models.EquityTrade T          O:META230728C00315000      302          4.76    1 {227}      1690471217323      1290750877

Question

What's a good way to make this more efficient?

Notes

This answer:

https://stackoverflow.com/a/43747641/268581

does illustrate an approach for C# using Newtonsoft Json.NET.

Here's the code for it:

JsonSerializer serializer = new JsonSerializer();
MyObject o;
using (FileStream s = File.Open("bigfile.json", FileMode.Open))
using (StreamReader sr = new StreamReader(s))
using (JsonReader reader = new JsonTextReader(sr))
{
    while (reader.Read())
    {
        // deserialize only when there's "{" character in the stream
        if (reader.TokenType == JsonToken.StartObject)
        {
            o = serializer.Deserialize<MyObject>(reader);
        }
    }
}

One approach would be to download the Newtonsoft Json.NET DLL, and convert the above to PowerShell. One challenge is this line:

o = serializer.Deserialize<MyObject>(reader);

As you can see, it's making a generic method call. It's not clear to me how this would be translated to Windows PowerShell 5.1.

A solution that only depends on native JSON deserialization libraries would be preferred, but the Newtonsoft approach would be acceptable if necessary.

dharmatech
  • 8,979
  • 8
  • 42
  • 88
  • 1
    PowerShell is designed for convenience, not performance. If you need performance, then you need to use a different language, like Python or C#. PowerShell usually translates pretty easily to C#. – Tim Roberts Jul 29 '23 at 04:50
  • 1
    A 1.2GB Json will be translated more than likely into an array of objects > 2GB and since you're using PowerShell 5.1 you should note that arrays [are limited to 2GB in .NET Framework](https://learn.microsoft.com/en-us/dotnet/api/system.array?view=net-7.0#remarks) which will lead to the exact same OutOfMemory Exception you had in previous question except that your current code is so slow its taking much longer for you to see that exception happening – Santiago Squarzon Jul 29 '23 at 04:57
  • @SantiagoSquarzon Is that 2GB limitation lifted in PowerShell 7? I would consider using that version for this application if so. – dharmatech Jul 29 '23 at 04:59
  • 1
    It might be possible that this works in pwsh 7+ mainly because it uses a much better Json parser, you might face the same array limitation there too but at least you have access to [`JsonObject.ConvertFromJson`](https://docs.microsoft.com/en-us/dotnet/api/microsoft.powershell.commands.jsonobject.convertfromjson?view=powershellsdk-7.3.0) to combine with a stream reading approach. source is open too: https://github.com/PowerShell/PowerShell/blob/91447e62dec9fdbd5d24822eb8fefcc2b216d1b1/src/Microsoft.PowerShell.Commands.Utility/commands/utility/WebCmdlet/ConvertFromJsonCommand.cs#L16 – Santiago Squarzon Jul 29 '23 at 05:18
  • @SantiagoSquarzon !!! Here's something that works! https://stackoverflow.com/a/76792191/268581 – dharmatech Jul 29 '23 at 05:34
  • @TimRoberts I've posted a solution. Runs plenty fast. – dharmatech Jul 29 '23 at 05:35

1 Answers1

4

Here's a version which works on PowerShell 7.3.6.

It only takes a few minutes and this is acceptable for my case.

Add-Type -Path 'C:\Users\dharm\OneDrive\Documents\WindowsPowerShell\Modules\newtonsoft.json\1.0.2.201\libs\Newtonsoft.Json.dll'

class Row {
    [string]$symbol
    [Int64]$timestamp
}

$serializer = [Newtonsoft.Json.JsonSerializer]::new()

# $stream = [System.IO.File]::Open("..\polygon-io.py\data-2023-07-27-13-33-26.json", [System.IO.FileMode]::Open)

$stream = [System.IO.File]::Open("C:\Users\dharm\Dropbox\Documents\polygon-io.py\data-2023-07-27-13-33-26.json", [System.IO.FileMode]::Open)

$reader = [System.IO.StreamReader]::new($stream)

$json_reader = [Newtonsoft.Json.JsonTextReader]::new($reader)

$ls = New-Object System.Collections.Generic.List[Row]

$i = 0

while ($json_reader.Read())
{
    if ($i % 1000 -eq 0) { Write-Host $i }
    
    if ($json_reader.TokenType -eq [Newtonsoft.Json.JsonToken]::StartObject)
    {        
        $obj = $serializer.Deserialize[Row]($json_reader)
        
        $ls.Add($obj)
    }

    $i++
}

$stream.Close()
$reader.Close()
$json_reader.Close()

# ----------------------------------------------------------------------
# examine result
# ----------------------------------------------------------------------

$ls.Count

$ls | Select-Object -First 10 | ft *
dharmatech
  • 8,979
  • 8
  • 42
  • 88