I was following this answer but after some discussion with it's writer, it seems it only gives a solution to orient='records'
data format.
This is the difference:
# orient='records'
[
{"Product":"Desktop Computer","Price":700},
{"Product":"Tablet","Price":250},
{"Product":"iPhone","Price":800},
{"Product":"Laptop","Price":1200}
]
# orient='index'
{
"0":{"Product":"Desktop Computer","Price":700},
"1":{"Product":"Tablet","Price":250},
"2":{"Product":"iPhone","Price":800},
"3":{"Product":"Laptop","Price":1200}
}
I have the index format because my data is from an SQL database read into a dataframe and the index field is needed to specify every records.
My json file is 2.5 GB, had been exported from the dataframe with orient='index'
format.
df.to_json('test.json', orient='index')
This means that the whole file is actually one huge string and not a list like collection of records:
{"0":{"Product":"Desktop Computer","Price":700},"1":{"Product":"Tablet","Price":250},"2":{"Product":"iPhone","Price":800},"3":{"Product":"Laptop","Price":1200}}
This means I can't use any line or chunck based iterative solution like this:
df = pd.read_json('test.json', orient='index', lines=True, chunksize=5)
According to the documentation, lines=True
can only be used if the records are in a list like format, this is why pandas.DataFrame.to_json
does not even accept this argument unless the orient is not orient='records'
. The restriction for chunksize=
comes from this as well, it says:
"This can only be passed if lines=True. If this is None, the file will be read into memory all at once."
And exactly this is the reason of the question, trying to read such a huge .json file gives back:
df = pd.read_json('test.json', orient='index')
File "C:\Users\Username\AppData\Local\Programs\Python\Python37\lib\site-
packages\pandas\io\json\_json.py", line 1100,
in _parse_no_numpy
loads(json, precise_float=self.precise_float),
MemoryError
I was thinking about adding the index values as a first column as well, this case it wouldn't be lost with the records format; or maybe even store an index list separately. Only I fear it would decrease the search performance later on.
Is there any solution to handle the situation strictly using the .json file and no other database or big-data based technology?
Update #1
For request here is the actual structure of my data. The SQL table:
Serial Date PatientID Type Gender YearWeek
0 425571118001461E 2011-06-30 20:59:30 186092 3 1.0 2011-w26
1 425571118001461E 2011-06-30 20:55:30 186092 3 1.0 2011-w26
2 425571118001461E 2013-08-28 09:29:30 186092 3 1.0 2013-w35
3 425571118001461E 2013-08-28 07:44:30 186092 3 1.0 2013-w35
4 425571118001461E 2013-08-27 20:44:30 186092 3 1.0 2013-w35
... ... ... ... ... ... ...
32290281 4183116300254921 2020-04-09 08:07:50 217553 8 2.0 2020-w15
32290282 4183116300254921 2020-04-08 10:29:50 217553 8 2.0 2020-w15
32290283 4141119420031548 2020-04-20 10:18:02 217555 12 2.0 2020-w17
32290284 4141119420043226 2020-04-20 12:33:11 217560 12 NaN 2020-w17
32290285 4141119420000825 2020-04-20 17:31:44 217568 12 1.0 2020-w17
The pandas pivot table is almost the same as in the example, but with a 50,000 rows and 4,000 columns:
df = df.pivot_table(index='PatientID', values='Serial', columns='YearWeek', aggfunc=len, fill_value=0)
YearWeek 1969-w01 1969-w02 1969-w03 1969-w04 1969-w05 ... 2138-w17 2138-w18 2138-w19 2138-w20 2138-w21
PatientID
0 0 0 0 0 0 ... 0 0 0 0 0
455 1 0 3 0 0 ... 0 0 0 0 0
40036 0 0 0 0 0 ... 0 0 0 0 0
40070 0 0 0 0 0 ... 0 0 0 0 0
40082 0 0 0 0 0 ... 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ...
217559 0 0 0 0 0 ... 0 0 0 0 0
217560 0 0 0 0 0 ... 0 0 0 0 0
217561 0 0 0 0 0 ... 0 0 0 0 0
217563 0 0 0 0 0 ... 0 0 0 0 0
217568 0 1 0 2 0 ... 0 0 0 0 0
And this is how it is saved with an index formatted json:
{
"0":{"1969-w01":0,"1969-w02":0,"1969-w03":0,"1969-w04":0, ...},
"455":{"1969-w01":1,"1969-w02":0,"1969-w03":3,"1969-w04":0, ...},
"40036":{"1969-w01":0,"1969-w02":0,"1969-w03":0,"1969-w04":0, ...},
...
"217568":{"1969-w01":0,"1969-w02":1,"1969-w03":0,"1969-w04":2, ...}
}
Only I could not give the line=True
arg, so it is actually cramped into one huge string making it a one-liner json:
{"0":{"1969-w01":0,"1969-w02":0,"1969-w03":0,"1969-w04":0, ...},"455":{"1969-w01":1,"1969-w02":0,"1969-w03":3,"1969-w04":0, ...},"40036":{"1969-w01":0,"1969-w02":0,"1969-w03":0,"1969-w04":0, ...}, ... "217568":{"1969-w01":0,"1969-w02":1,"1969-w03":0,"1969-w04":2, ...}}