1

I have a JSON file that is 1.19GB in size:

PS C:\Users\dharm\Dropbox\Documents\polygon-io.ps1> dir ..\polygon-io.py\data-2023-07-27-13-33-26.json | Format-Table Name, @{ E = 'Length'; F = 'N' } 

Name                                    Length
----                                    ------
data-2023-07-27-13-33-26.json 1,191,614,699.00

If I run the following to deserialize the JSON data, I get a System.OutOfMemoryException:

PS C:\Users\dharm\Dropbox\Documents\polygon-io.ps1> $data_python = Get-Content ..\polygon-io.py\data-2023-07-27-13-33-26.json | ConvertFrom-Json
Get-Content : Exception of type 'System.OutOfMemoryException' was thrown.
At line:1 char:16
+ ... ta_python = Get-Content ..\polygon-io.py\data-2023-07-27-13-33-26.jso ...
+                 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : InvalidOperation: (:) [Get-Content], OutOfMemoryException
    + FullyQualifiedErrorId : ProviderContentReadError,Microsoft.PowerShell.Commands.GetContentCommand

At the time I ran that, the system was using 31GB of RAM. It has 64GB total:

enter image description here

Is there a way to import this large file using PowerShell?

Update : Get-Content

The issue appears to have nothing to do with ConvertFrom-Json.

If I simply call Get-Content on the file, the same exception occurs:

PS C:\Users\dharm\Dropbox\Documents\polygon-io.ps1> $json = Get-Content ..\polygon-io.py\data-2023-07-27-13-33-26.json
Get-Content : Exception of type 'System.OutOfMemoryException' was thrown.
At line:1 char:9
+ $json = Get-Content ..\polygon-io.py\data-2023-07-27-13-33-26.json
+         ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : InvalidOperation: (:) [Get-Content], OutOfMemoryException
    + FullyQualifiedErrorId : ProviderContentReadError,Microsoft.PowerShell.Commands.GetContentCommand

Update : Read

If I use the Read:

$from = "..\polygon-io.py\data-2023-07-27-13-33-26.json"
$upperBound = 100MB
$result = ""

$fromFile = [io.file]::OpenRead($from)
$buff = new-object byte[] $upperBound
$count = 0
try {
    do {
        "Reading $upperBound"
        $count = $fromFile.Read($buff, 0, $buff.Length)
        if ($count -gt 0) {
            "Writing $count to result"
            $result = $result + [System.Text.Encoding]::UTF8.GetString($buff, 0, $count)
        }
    } while ($count -gt 0)
}
finally {
    $fromFile.Close()
}

it runs into an exception after 11 iterations:

Reading 104857600
Writing 104857600 to result
Reading 104857600
Writing 104857600 to result
Reading 104857600
Writing 104857600 to result
Reading 104857600
Writing 104857600 to result
Reading 104857600
Writing 104857600 to result
Reading 104857600
Writing 104857600 to result
Reading 104857600
Writing 104857600 to result
Reading 104857600
Writing 104857600 to result
Reading 104857600
Writing 104857600 to result
Reading 104857600
Writing 104857600 to result
Reading 104857600
Writing 104857600 to result
Exception of type 'System.OutOfMemoryException' was thrown.
At line:14 char:13
+             $result = $result + [System.Text.Encoding]::UTF8.GetStrin ...
+             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : OperationStopped: (:) [], OutOfMemoryException
    + FullyQualifiedErrorId : System.OutOfMemoryException

This approach is a modification of this answer.

Update : StringBuilder

Appending to a StringBuilder:

$from = "..\polygon-io.py\data-2023-07-27-13-33-26.json"
$upperBound = 100MB

$sb = [System.Text.StringBuilder]::new()

$fromFile = [io.file]::OpenRead($from)
$buff = new-object byte[] $upperBound
$count = 0
try {
    do {
        "Reading $upperBound"
        $count = $fromFile.Read($buff, 0, $buff.Length)
        if ($count -gt 0) {
            "Writing $count to result"
            $sb.Append([System.Text.Encoding]::UTF8.GetString($buff, 0, $count))
        }
    } while ($count -gt 0)
}
finally {
    $fromFile.Close()
}

works:

Reading 104857600
Writing 104857600 to result

 Capacity MaxCapacity    Length
 -------- -----------    ------
104857600  2147483647 104857600
Reading 104857600
Writing 104857600 to result
209715200  2147483647 209715200
Reading 104857600
Writing 104857600 to result
314572800  2147483647 314572800
Reading 104857600
Writing 104857600 to result
419430400  2147483647 419430400
Reading 104857600
Writing 104857600 to result
524288000  2147483647 524288000
Reading 104857600
Writing 104857600 to result
629145600  2147483647 629145600
Reading 104857600
Writing 104857600 to result
734003200  2147483647 734003200
Reading 104857600
Writing 104857600 to result
838860800  2147483647 838860800
Reading 104857600
Writing 104857600 to result
943718400  2147483647 943718400
Reading 104857600
Writing 104857600 to result
...576000  2147483647 ...576000
Reading 104857600
Writing 104857600 to result
...433600  2147483647 ...433600
Reading 104857600
Writing 38181099 to result
...614699  2147483647 ...614699
Reading 104857600

But converting it to a string runs into the exception:

PS C:\Users\dharm\Dropbox\Documents\polygon-io.ps1> $result = $sb.ToString()
Exception calling "ToString" with "0" argument(s): "Exception of type 'System.OutOfMemoryException' was thrown."
At line:1 char:1
+ $result = $sb.ToString()
+ ~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : NotSpecified: (:) [], MethodInvocationException
    + FullyQualifiedErrorId : OutOfMemoryException

Creating a large file to test this

You can create a 1.2GB file on Windows via the following command:

fsutil file createNew large.txt 1200000000

Get-Content throws an exception on this file on my system:

PS C:\Users\dharm\Dropbox\Documents\polygon-io.ps1> $large = Get-Content .\large.txt
Get-Content : Exception of type 'System.OutOfMemoryException' was thrown.
At line:1 char:10
+ $large = Get-Content .\large.txt
+          ~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : InvalidOperation: (:) [Get-Content], OutOfMemoryException
    + FullyQualifiedErrorId : ProviderContentReadError,Microsoft.PowerShell.Commands.GetContentCommand

Update : Comparison with python

This Python code can load the file, deserialize it, and load it into pandas:

from polygon import WebSocketClient
from polygon.websocket.models import WebSocketMessage

import pandas as pd
import jsonpickle

file = open (r'c:\users\dharm\dropbox\documents\polygon-io.py\data-2023-07-27-13-33-26.json')

json = file.read()

data = jsonpickle.decode(json)

df = pd.DataFrame(data)

Update : fast solution

Here's an approach for PowerShell 7.3.6 which runs in a few minutes:

https://stackoverflow.com/a/76792191/268581

dharmatech
  • 8,979
  • 8
  • 42
  • 88
  • 2
    you should look into json stream reading and splitting it into smaller chunk files. powershell might not be the best language to handle this problem – Santiago Squarzon Jul 28 '23 at 02:40
  • big file with json is bad choice. You need whole object in memory in most cases. – Deadooshka Jul 28 '23 at 05:55
  • @SantiagoSquarzon Looks like the issue isn't due to `ConvertFrom-Json`. The exception also happens with `Get-Content` alone. I updated the question to show this. – dharmatech Jul 28 '23 at 06:19
  • 1
    I wonder whether there are any line breaks in the file considering that it also fails on `Get-Contrent`. – iRon Jul 28 '23 at 06:30
  • @iRon I updated the question with a couple of more approaches (`Read` and `StringBuilder`). Both also run into exceptions. – dharmatech Jul 28 '23 at 06:59
  • @iRon I added a section to the question which shows how to create a 1.2GB file to test this out. I'd assume that file has no newlines. – dharmatech Jul 28 '23 at 07:06
  • provide a json structure. Maybe you need to split this json into array items. if you have an array like [{...}, {...}, ....] then you can parse those single json objects. – Deadooshka Jul 28 '23 at 08:42
  • 1
    You should be at least able to `ReadBlock` from a [`StreamReader`](https://learn.microsoft.com/dotnet/api/system.io.streamreader) and than try something @Deadooshka suggest. – iRon Jul 28 '23 at 10:06
  • 1
    you can try `[System.IO.File]::ReadLines(path\to\file) | ConvertFrom-Json`, but be aware `ConvertFrom-Json` needs to consume all input beforehand so its very likely you will encounter the same out of memory exception this time thrown by the the json cmdlet – Santiago Squarzon Jul 28 '23 at 12:33
  • 1
    If you are more clear on the structure of the concerned Json file, it might be resolved by streaming it in the way as answered here: [How to parse huge JSON file as stream in Json.NET?](https://stackoverflow.com/q/43747477/1701026). Again: apparently you are now able to read it (partly). Are there any newlines? is it a (large) array (starting with a square bracket "`[`") – iRon Jul 28 '23 at 15:38
  • @iRon Thank you very much for the link to that post. That approach does appear promising. I've added an answer including an approach based on that. It does seem to work, however, it's very slow. It seems it can be optimized however. [Link to answer I posted](https://stackoverflow.com/a/76791900/268581). – dharmatech Jul 29 '23 at 03:06

2 Answers2

0

EDIT: You answered your own question!

System.Text.StringBuilder -ge System.String

Since you can load it using StringBuilder, there's no advantage to then converting to a string! The linked article also explains how StringBuilder offers better performance and gives you mutable strings, instead of immutable System.String's

Old Answer: I believe your best option would be to use a module, newtonsoft.json Once the module is installed, all you need to do is run

Import-Module newtonsoft.json

Then slightly modify the last part of your one line to look like this:

$data_python = Get-Content ..\polygon-io.py\data-2023-07-27-13-33-26.json | ConvertFrom-JsonNewtonsoft

This module utilizes a high performance .NET Framework Library called JSON.NET to deserialize JSON. It should be able to handle your large JSON file without issue.

nevarDeath
  • 137
  • 2
  • 7
  • Thanks for the suggestion! That approach ran into the same issue, though. I updated the question to show that the issue occurs with just `Get-Content`. – dharmatech Jul 28 '23 at 06:18
0

User iRon made the following suggestion in a comment:

enter image description here

Here's a link to an answer in that post:

https://stackoverflow.com/a/43747641/268581

Here's an approach that is based on the that answer:

$stream = [System.IO.File]::Open("..\polygon-io.py\data-2023-07-27-13-33-26.json", [System.IO.FileMode]::Open)

$stream.ReadByte() # read '['

$json = ''

$data = @()

for ($i = 1; $i -lt $stream.Length; $i++)
{
    $byte = $stream.ReadByte()

    $char = [Convert]::ToChar($byte)

    if ($char -eq '}')
    {
        $json = $json + [Convert]::ToChar($byte)
        
        $data = $data + ($json | ConvertFrom-Json)

        $json = ''

        $stream.ReadByte() | Out-Null # read comma

        if ($data.Count % 100 -eq 0)
        {
            Write-Host $data.Count
        }
    }
    else
    {
        $json = $json + [Convert]::ToChar($byte)
    }
}

$stream.Close()

The only drawback is that it is excruciatingly slow. However, it seems this can be optimized.

dharmatech
  • 8,979
  • 8
  • 42
  • 88
  • 1
    Apart from the fact that there quiet some pitfalls in this prototype (e.g. the "`}`" might be used at several levels and simply included in a string value), this looks like the correct way to go. Note that: `$json = $json + ...` is similar to `Sjson += ...` which is pretty expensive, see: [Is there a string concatenation shortcut in PowerShell?](https://stackoverflow.com/a/70093215/1701026) and [PowerShell Performance Considerations/String Addition](https://learn.microsoft.com/powershell/scripting/dev-cross-plat/performance/script-authoring-considerations#string-addition) – iRon Jul 29 '23 at 10:39