Context
In this post:
I ask about deserializing a 1.2GB JSON file.
This answer posted there:
does work, but it's extremely slow.
Sample data
So that you don't have to use a 1.2GB file, here's a small data example for use with this question. It's just the first few items from the original large JSON file.
example.json
:
[{"py/object": "polygon.websocket.models.models.EquityTrade", "event_type": "T", "symbol": "O:AMD230728C00115000", "exchange": 304, "id": null, "tape": null, "price": 0.38, "size": 1, "conditions": [227], "timestamp": 1690471217275, "sequence_number": 1477738810, "trf_id": null, "trf_timestamp": null}, {"py/object": "polygon.websocket.models.models.EquityTrade", "event_type": "T", "symbol": "O:AFRM230728C00019500", "exchange": 302, "id": null, "tape": null, "price": 0.07, "size": 10, "conditions": [209], "timestamp": 1690471217278, "sequence_number": 1477739110, "trf_id": null, "trf_timestamp": null}, {"py/object": "polygon.websocket.models.models.EquityTrade", "event_type": "T", "symbol": "O:TSLA230804C00270000", "exchange": 325, "id": null, "tape": null, "price": 4.8, "size": 7, "conditions": [219], "timestamp": 1690471217282, "sequence_number": 341519150, "trf_id": null, "trf_timestamp": null}, {"py/object": "polygon.websocket.models.models.EquityTrade", "event_type": "T", "symbol": "O:TSLA230804C00270000", "exchange": 312, "id": null, "tape": null, "price": 4.8, "size": 1, "conditions": [209], "timestamp": 1690471217282, "sequence_number": 341519166, "trf_id": null, "trf_timestamp": null}, {"py/object": "polygon.websocket.models.models.EquityTrade", "event_type": "T", "symbol": "O:TSLA230804C00270000", "exchange": 312, "id": null, "tape": null, "price": 4.8, "size": 1, "conditions": [209], "timestamp": 1690471217282, "sequence_number": 341519167, "trf_id": null, "trf_timestamp": null}, {"py/object": "polygon.websocket.models.models.EquityTrade", "event_type": "T", "symbol": "O:TSLA230804C00270000", "exchange": 319, "id": null, "tape": null, "price": 4.8, "size": 5, "conditions": [219], "timestamp": 1690471217282, "sequence_number": 341519170, "trf_id": null, "trf_timestamp": null}, {"py/object": "polygon.websocket.models.models.EquityTrade", "event_type": "T", "symbol": "O:TSLA230804C00270000", "exchange": 312, "id": null, "tape": null, "price": 4.8, "size": 19, "conditions": [209], "timestamp": 1690471217284, "sequence_number": 341519682, "trf_id": null, "trf_timestamp": null}, {"py/object": "polygon.websocket.models.models.EquityTrade", "event_type": "T", "symbol": "O:TSLA230804C00270000", "exchange": 301, "id": null, "tape": null, "price": 4.8, "size": 2, "conditions": [219], "timestamp": 1690471217290, "sequence_number": 341519926, "trf_id": null, "trf_timestamp": null}, {"py/object": "polygon.websocket.models.models.EquityTrade", "event_type": "T", "symbol": "O:TSLA230804C00270000", "exchange": 301, "id": null, "tape": null, "price": 4.8, "size": 15, "conditions": [219], "timestamp": 1690471217290, "sequence_number": 341519927, "trf_id": null, "trf_timestamp": null}, {"py/object": "polygon.websocket.models.models.EquityTrade", "event_type": "T", "symbol": "O:META230728C00315000", "exchange": 302, "id": null, "tape": null, "price": 4.76, "size": 1, "conditions": [227], "timestamp": 1690471217323, "sequence_number": 1290750877, "trf_id": null, "trf_timestamp": null}]
Code
Here's (slow) code that works. It takes hours to run on the 1.2GB file.
$path = ".\example.json"
$stream = [System.IO.File]::Open($path, [System.IO.FileMode]::Open)
$i = 0
$stream.ReadByte() # read '['
$i++
$json = ''
$data = @()
while ($i -lt $stream.Length)
{
$byte = $stream.ReadByte(); $i++
$char = [Convert]::ToChar($byte)
if ($char -eq '}')
{
$json = $json + [Convert]::ToChar($byte)
$data = $data + ($json | ConvertFrom-Json)
$json = ''
$stream.ReadByte() | Out-Null # read comma;
$i++
if ($data.Count % 100 -eq 0)
{
Write-Host $data.Count
}
}
else
{
$json = $json + [Convert]::ToChar($byte)
}
}
$stream.Close()
After running it, you should have the records in $data
:
PS C:\Users\dharm\Dropbox\Documents\polygon-io.ps1> $data | ft *
py/object event_type symbol exchange id tape price size conditions timestamp sequence_number trf_id trf_timestamp
--------- ---------- ------ -------- -- ---- ----- ---- ---------- --------- --------------- ------ -------------
polygon.websocket.models.models.EquityTrade T O:AMD230728C00115000 304 0.38 1 {227} 1690471217275 1477738810
polygon.websocket.models.models.EquityTrade T O:AFRM230728C00019500 302 0.07 10 {209} 1690471217278 1477739110
polygon.websocket.models.models.EquityTrade T O:TSLA230804C00270000 325 4.8 7 {219} 1690471217282 341519150
polygon.websocket.models.models.EquityTrade T O:TSLA230804C00270000 312 4.8 1 {209} 1690471217282 341519166
polygon.websocket.models.models.EquityTrade T O:TSLA230804C00270000 312 4.8 1 {209} 1690471217282 341519167
polygon.websocket.models.models.EquityTrade T O:TSLA230804C00270000 319 4.8 5 {219} 1690471217282 341519170
polygon.websocket.models.models.EquityTrade T O:TSLA230804C00270000 312 4.8 19 {209} 1690471217284 341519682
polygon.websocket.models.models.EquityTrade T O:TSLA230804C00270000 301 4.8 2 {219} 1690471217290 341519926
polygon.websocket.models.models.EquityTrade T O:TSLA230804C00270000 301 4.8 15 {219} 1690471217290 341519927
polygon.websocket.models.models.EquityTrade T O:META230728C00315000 302 4.76 1 {227} 1690471217323 1290750877
Question
What's a good way to make this more efficient?
Notes
This answer:
does illustrate an approach for C# using Newtonsoft Json.NET.
Here's the code for it:
JsonSerializer serializer = new JsonSerializer();
MyObject o;
using (FileStream s = File.Open("bigfile.json", FileMode.Open))
using (StreamReader sr = new StreamReader(s))
using (JsonReader reader = new JsonTextReader(sr))
{
while (reader.Read())
{
// deserialize only when there's "{" character in the stream
if (reader.TokenType == JsonToken.StartObject)
{
o = serializer.Deserialize<MyObject>(reader);
}
}
}
One approach would be to download the Newtonsoft Json.NET DLL, and convert the above to PowerShell. One challenge is this line:
o = serializer.Deserialize<MyObject>(reader);
As you can see, it's making a generic method call. It's not clear to me how this would be translated to Windows PowerShell 5.1.
A solution that only depends on native JSON deserialization libraries would be preferred, but the Newtonsoft approach would be acceptable if necessary.