I have a JSON file that is 1.19GB in size:
PS C:\Users\dharm\Dropbox\Documents\polygon-io.ps1> dir ..\polygon-io.py\data-2023-07-27-13-33-26.json | Format-Table Name, @{ E = 'Length'; F = 'N' }
Name Length
---- ------
data-2023-07-27-13-33-26.json 1,191,614,699.00
If I run the following to deserialize the JSON data, I get a System.OutOfMemoryException
:
PS C:\Users\dharm\Dropbox\Documents\polygon-io.ps1> $data_python = Get-Content ..\polygon-io.py\data-2023-07-27-13-33-26.json | ConvertFrom-Json
Get-Content : Exception of type 'System.OutOfMemoryException' was thrown.
At line:1 char:16
+ ... ta_python = Get-Content ..\polygon-io.py\data-2023-07-27-13-33-26.jso ...
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ CategoryInfo : InvalidOperation: (:) [Get-Content], OutOfMemoryException
+ FullyQualifiedErrorId : ProviderContentReadError,Microsoft.PowerShell.Commands.GetContentCommand
At the time I ran that, the system was using 31GB of RAM. It has 64GB total:
Is there a way to import this large file using PowerShell?
Update : Get-Content
The issue appears to have nothing to do with ConvertFrom-Json
.
If I simply call Get-Content
on the file, the same exception occurs:
PS C:\Users\dharm\Dropbox\Documents\polygon-io.ps1> $json = Get-Content ..\polygon-io.py\data-2023-07-27-13-33-26.json
Get-Content : Exception of type 'System.OutOfMemoryException' was thrown.
At line:1 char:9
+ $json = Get-Content ..\polygon-io.py\data-2023-07-27-13-33-26.json
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ CategoryInfo : InvalidOperation: (:) [Get-Content], OutOfMemoryException
+ FullyQualifiedErrorId : ProviderContentReadError,Microsoft.PowerShell.Commands.GetContentCommand
Update : Read
If I use the Read
:
$from = "..\polygon-io.py\data-2023-07-27-13-33-26.json"
$upperBound = 100MB
$result = ""
$fromFile = [io.file]::OpenRead($from)
$buff = new-object byte[] $upperBound
$count = 0
try {
do {
"Reading $upperBound"
$count = $fromFile.Read($buff, 0, $buff.Length)
if ($count -gt 0) {
"Writing $count to result"
$result = $result + [System.Text.Encoding]::UTF8.GetString($buff, 0, $count)
}
} while ($count -gt 0)
}
finally {
$fromFile.Close()
}
it runs into an exception after 11 iterations:
Reading 104857600
Writing 104857600 to result
Reading 104857600
Writing 104857600 to result
Reading 104857600
Writing 104857600 to result
Reading 104857600
Writing 104857600 to result
Reading 104857600
Writing 104857600 to result
Reading 104857600
Writing 104857600 to result
Reading 104857600
Writing 104857600 to result
Reading 104857600
Writing 104857600 to result
Reading 104857600
Writing 104857600 to result
Reading 104857600
Writing 104857600 to result
Reading 104857600
Writing 104857600 to result
Exception of type 'System.OutOfMemoryException' was thrown.
At line:14 char:13
+ $result = $result + [System.Text.Encoding]::UTF8.GetStrin ...
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ CategoryInfo : OperationStopped: (:) [], OutOfMemoryException
+ FullyQualifiedErrorId : System.OutOfMemoryException
This approach is a modification of this answer.
Update : StringBuilder
Appending to a StringBuilder
:
$from = "..\polygon-io.py\data-2023-07-27-13-33-26.json"
$upperBound = 100MB
$sb = [System.Text.StringBuilder]::new()
$fromFile = [io.file]::OpenRead($from)
$buff = new-object byte[] $upperBound
$count = 0
try {
do {
"Reading $upperBound"
$count = $fromFile.Read($buff, 0, $buff.Length)
if ($count -gt 0) {
"Writing $count to result"
$sb.Append([System.Text.Encoding]::UTF8.GetString($buff, 0, $count))
}
} while ($count -gt 0)
}
finally {
$fromFile.Close()
}
works:
Reading 104857600
Writing 104857600 to result
Capacity MaxCapacity Length
-------- ----------- ------
104857600 2147483647 104857600
Reading 104857600
Writing 104857600 to result
209715200 2147483647 209715200
Reading 104857600
Writing 104857600 to result
314572800 2147483647 314572800
Reading 104857600
Writing 104857600 to result
419430400 2147483647 419430400
Reading 104857600
Writing 104857600 to result
524288000 2147483647 524288000
Reading 104857600
Writing 104857600 to result
629145600 2147483647 629145600
Reading 104857600
Writing 104857600 to result
734003200 2147483647 734003200
Reading 104857600
Writing 104857600 to result
838860800 2147483647 838860800
Reading 104857600
Writing 104857600 to result
943718400 2147483647 943718400
Reading 104857600
Writing 104857600 to result
...576000 2147483647 ...576000
Reading 104857600
Writing 104857600 to result
...433600 2147483647 ...433600
Reading 104857600
Writing 38181099 to result
...614699 2147483647 ...614699
Reading 104857600
But converting it to a string runs into the exception:
PS C:\Users\dharm\Dropbox\Documents\polygon-io.ps1> $result = $sb.ToString()
Exception calling "ToString" with "0" argument(s): "Exception of type 'System.OutOfMemoryException' was thrown."
At line:1 char:1
+ $result = $sb.ToString()
+ ~~~~~~~~~~~~~~~~~~~~~~~~
+ CategoryInfo : NotSpecified: (:) [], MethodInvocationException
+ FullyQualifiedErrorId : OutOfMemoryException
Creating a large file to test this
You can create a 1.2GB file on Windows via the following command:
fsutil file createNew large.txt 1200000000
Get-Content
throws an exception on this file on my system:
PS C:\Users\dharm\Dropbox\Documents\polygon-io.ps1> $large = Get-Content .\large.txt
Get-Content : Exception of type 'System.OutOfMemoryException' was thrown.
At line:1 char:10
+ $large = Get-Content .\large.txt
+ ~~~~~~~~~~~~~~~~~~~~~~~
+ CategoryInfo : InvalidOperation: (:) [Get-Content], OutOfMemoryException
+ FullyQualifiedErrorId : ProviderContentReadError,Microsoft.PowerShell.Commands.GetContentCommand
Update : Comparison with python
This Python code can load the file, deserialize it, and load it into pandas:
from polygon import WebSocketClient
from polygon.websocket.models import WebSocketMessage
import pandas as pd
import jsonpickle
file = open (r'c:\users\dharm\dropbox\documents\polygon-io.py\data-2023-07-27-13-33-26.json')
json = file.read()
data = jsonpickle.decode(json)
df = pd.DataFrame(data)
Update : fast solution
Here's an approach for PowerShell 7.3.6 which runs in a few minutes: