1

I was looking for a method to save numpy data with json while preserving numpy's human readable pretty print format.

Inspired by this answer, I opted to use pprint instead of base64 to write the data with my desired formatting, so that given:

import numpy as np
data = np.random.random((1,3,2))

The resulting file on disk should look like:

{
    "__dtype__": "float64", 
    "__ndarray__": [[[0.7672818918130646 , 0.6846412220229668 ],
                     [0.7082023466738064 , 0.0896531267221291 ],
                     [0.43898454934160147, 0.9245898883694668 ]]]
}

A few hiccups appeared.

  • While json could read back in lists of lists formatted as [[...]], it had issues with numpy's float formatting. For example [[0., 0., 0.]] would generate an error when read back in, while [[0.0, 0.0, 0.0]] would be fine.

  • pformat would output array([[0., 0., 0.]]) where array() has to be parsed out otherwise json throws an error when reading the data back in.

To fix these I had to do some string parsing, which lead to my current code below:

import json, sys
import numpy as np
import pprint as pp

# Set numpy's printoptions to display all the data with max precision
np.set_printoptions(threshold=np.inf,
                    linewidth=sys.maxsize,
                    suppress=True,
                    nanstr='0.0',
                    infstr='0.0', 
                    precision=np.finfo(np.longdouble).precision)     



# Modified version of Adam Hughes's https://stackoverflow.com/a/27948073/1429402
def save_formatted(fname,data):

    class NumpyEncoder(json.JSONEncoder):
        def default(self, obj):
            if isinstance(obj, np.ndarray):
                return {'__ndarray__': self.numpy_to_string(obj),
                        '__dtype__': str(obj.dtype)}            

            return json.JSONEncoder.default(self, obj)


        def numpy_to_string(self,data):
            ''' Use pprint to generate a nicely formatted string
            '''

            # Get rid of array(...) and keep only [[...]]
            f = pp.pformat(data, width=sys.maxsize)
            f = f[6:-1].splitlines() # get rid of array(...) and keep only [[...]]

            # Remove identation caused by printing "array(" 
            for i in xrange(1,len(f)):
                f[i] = f[i][6:]

            return '\n'.join(f)


    # Parse json stream and fix formatting.
    # JSON doesn't support float arrays written as [0., 0., 0.]
    # so we look for the problematic numpy print syntax and correct
    # it to be readable natively by JSON, in this case: [0.0, 0.0, 0.0]
    with open(fname,'w') as io:
        for line in json.dumps(data, sort_keys=False, indent=4, cls=NumpyEncoder).splitlines():
            if '"__ndarray__": "' in line:
                index = line.index('"__ndarray__": "')
                lines = line.split('"__ndarray__": "')[-1][:-1]
                lines = lines.replace('. ','.0')  # convert occurences of ". " to ".0"    ex: 3. , 2. ]
                lines = lines.replace('.,','.0,') # convert occurences of ".," to ".0,"   ex: 3., 2.,
                lines = lines.replace('.]','.0]') # convert occurences of ".]" to ".0],"  ex: 3., 2.]
                lines = lines.split('\\n')

                # write each lines with appropriate indentation
                for i in xrange(len(lines)):
                    if i == 0:
                        indent = ' '*index
                        io.write(('%s"__ndarray__": %s\n"'%(indent,lines[i]))[:-1]) 
                    else:
                        indent = ' '*(index+len('"__ndarray__": "')-1)
                        io.write('%s%s\n'%(indent,lines[i]))                        

            else:
                io.write('%s\n'%line)



def load_formatted(fname):

    def json_numpy_obj_hook(dct):
        if isinstance(dct, dict) and '__ndarray__' in dct:
            return np.array(dct['__ndarray__']).astype(dct['__dtype__'])        
        return dct

    with open(fname,'r') as io:
        return json.load(io, object_hook=json_numpy_obj_hook)

To test:

data = np.random.random((200,3,1000))
save_formatted('test.data', data)
data_ = load_formatted('test.data')

print np.allclose(data,data_) # Returns True

QUESTION

My solution suits me, but the string parsing aspect of it makes it slow with large data arrays. Would there be a better way to achieve the desired effect? Could a regular expression replace my sequence str.replace() calls? Or maybe pprint can be used to format my string correctly in the first place? Is there a better way to make json write lists like numpy's print formatting?

Fnord
  • 5,365
  • 4
  • 31
  • 48

1 Answers1

0

I can't give concrete pointers, but I believe your best bet is to find some open source pretty-print library and tweak it with the rules numpy uses (numpy is also open source, so it should not be difficult to "reverse engineer" it).

One example thanks to How to prettyprint a JSON file? : https://github.com/andy-gh/pygrid/blob/master/prettyjson.py (not necessarily good example, but illustrates well the size of the prettyprinter is not that big).

My confidence lies with the fact that it will be much faster to just spit out all those elements and gaps between them than to use replace (which I see in your code) on the result of another pretty-printer.

Even better, if the routine can be rewritten in cython.

If you are interested in parsing, ijson and library it uses can provide iterative parsing of streamed json, which can be of help if your json does not fit in RAM.

Roman Susi
  • 4,135
  • 2
  • 32
  • 47