I have a JSON file with the following exemplified format,
{
"Table1": {
"Records": [
{
"Key1Tab1": "SomeVal",
"Key2Tab1": "AnotherVal"
},
{
"Key1Tab1": "SomeVal2",
"Key2Tab1": "AnotherVal2"
}
]
},
"Table2": {
"Records": [
{
"Key1Tab1": "SomeVal",
"Key2Tab1": "AnotherVal"
},
{
"Key1Tab1": "SomeVal2",
"Key2Tab1": "AnotherVal2"
}
]
}
}
The root keys are table names from an SQL database and its corresponding value is the rows.
I want to split the JSON file into seperate parquet files each representing a table.
Ie. Table1.parquet
and Table2.parquet
.
The big issue is the size of the file preventing me from loading it into memory. Hence, I tried to use dask.bag to accommodate for the nested structure of the file.
import dask.bag as db
from dask.distributed import Client
client = Client(n_workers=4)
lines = db.read_text("filename.json")
But assessing the output with lines.take(4)
shows that dask can't read the new lines correct.
('{\n', ' "Table1": {\n', ' "Records": [\n', ' {\n')
I've tried to search for solutions to the specific problem but without luck.
Is there any chance that the splitting can be solved with dask or is there other tools that could do the job?