I was looking for a method to save numpy
data with json
while preserving numpy's human readable pretty print format.
Inspired by this answer, I opted to use pprint
instead of base64
to write the data with my desired formatting, so that given:
import numpy as np
data = np.random.random((1,3,2))
The resulting file on disk should look like:
{
"__dtype__": "float64",
"__ndarray__": [[[0.7672818918130646 , 0.6846412220229668 ],
[0.7082023466738064 , 0.0896531267221291 ],
[0.43898454934160147, 0.9245898883694668 ]]]
}
A few hiccups appeared.
While
json
could read back in lists of lists formatted as[[...]]
, it had issues withnumpy
's float formatting. For example[[0., 0., 0.]]
would generate an error when read back in, while[[0.0, 0.0, 0.0]]
would be fine.pformat
would outputarray([[0., 0., 0.]])
wherearray()
has to be parsed out otherwisejson
throws an error when reading the data back in.
To fix these I had to do some string parsing, which lead to my current code below:
import json, sys
import numpy as np
import pprint as pp
# Set numpy's printoptions to display all the data with max precision
np.set_printoptions(threshold=np.inf,
linewidth=sys.maxsize,
suppress=True,
nanstr='0.0',
infstr='0.0',
precision=np.finfo(np.longdouble).precision)
# Modified version of Adam Hughes's https://stackoverflow.com/a/27948073/1429402
def save_formatted(fname,data):
class NumpyEncoder(json.JSONEncoder):
def default(self, obj):
if isinstance(obj, np.ndarray):
return {'__ndarray__': self.numpy_to_string(obj),
'__dtype__': str(obj.dtype)}
return json.JSONEncoder.default(self, obj)
def numpy_to_string(self,data):
''' Use pprint to generate a nicely formatted string
'''
# Get rid of array(...) and keep only [[...]]
f = pp.pformat(data, width=sys.maxsize)
f = f[6:-1].splitlines() # get rid of array(...) and keep only [[...]]
# Remove identation caused by printing "array("
for i in xrange(1,len(f)):
f[i] = f[i][6:]
return '\n'.join(f)
# Parse json stream and fix formatting.
# JSON doesn't support float arrays written as [0., 0., 0.]
# so we look for the problematic numpy print syntax and correct
# it to be readable natively by JSON, in this case: [0.0, 0.0, 0.0]
with open(fname,'w') as io:
for line in json.dumps(data, sort_keys=False, indent=4, cls=NumpyEncoder).splitlines():
if '"__ndarray__": "' in line:
index = line.index('"__ndarray__": "')
lines = line.split('"__ndarray__": "')[-1][:-1]
lines = lines.replace('. ','.0') # convert occurences of ". " to ".0" ex: 3. , 2. ]
lines = lines.replace('.,','.0,') # convert occurences of ".," to ".0," ex: 3., 2.,
lines = lines.replace('.]','.0]') # convert occurences of ".]" to ".0]," ex: 3., 2.]
lines = lines.split('\\n')
# write each lines with appropriate indentation
for i in xrange(len(lines)):
if i == 0:
indent = ' '*index
io.write(('%s"__ndarray__": %s\n"'%(indent,lines[i]))[:-1])
else:
indent = ' '*(index+len('"__ndarray__": "')-1)
io.write('%s%s\n'%(indent,lines[i]))
else:
io.write('%s\n'%line)
def load_formatted(fname):
def json_numpy_obj_hook(dct):
if isinstance(dct, dict) and '__ndarray__' in dct:
return np.array(dct['__ndarray__']).astype(dct['__dtype__'])
return dct
with open(fname,'r') as io:
return json.load(io, object_hook=json_numpy_obj_hook)
To test:
data = np.random.random((200,3,1000))
save_formatted('test.data', data)
data_ = load_formatted('test.data')
print np.allclose(data,data_) # Returns True
QUESTION
My solution suits me, but the string parsing aspect of it makes it slow with large data arrays. Would there be a better way to achieve the desired effect? Could a regular expression
replace my sequence str.replace()
calls? Or maybe pprint
can be used to format my string correctly in the first place? Is there a better way to make json
write lists like numpy
's print formatting?