55

numpy.array.tostring doesn't seem to preserve information about matrix dimensions (see this question), requiring the user to issue a call to numpy.array.reshape.

Is there a way to serialize a numpy array to JSON format while preserving this information?

Note: The arrays may contain ints, floats or bools. It's reasonable to expect a transposed array.

Note 2: this is being done with the intent of passing the numpy array through a Storm topology using streamparse, in case such information ends up being relevant.

Community
  • 1
  • 1
Louis Thibault
  • 20,240
  • 25
  • 83
  • 152
  • Why do you downvote? My solution is correct and works for numpy arrays of any dimension and any data type. – daniel451 Jun 07 '15 at 20:45
  • @ascenator, Downvotes aren't coming from me. Somebody's having a bad day, I guess :/ – Louis Thibault Jun 07 '15 at 20:52
  • Wow...who is downvoting a solution in a thread where he himself is not the owner?^^ Then..sorry for the inconvenience. I hope you are happy with the solution :) – daniel451 Jun 07 '15 at 20:54
  • @ascenator: Maybe because it fails on [structured arrays](http://docs.scipy.org/doc/numpy/user/basics.rec.html)? It also requires that the array be C-contiguous, and I suspect it might also do the wrong thing if an array is serialized on a little-endian system and deserialized on a big-endian system or vice versa, but I don't have the equipment to check. I'm not the downvoter and don't know the downvoter's reasons, but I wouldn't upvote it. – user2357112 Jun 07 '15 at 21:18
  • 2
    Does it need to be a text format? Because `numpy.save` and `numpy.load` (which use a binary format) *do* save the shape of the array (and the type, and the order). – Roland Smith Jun 07 '15 at 21:18
  • @RolandSmith, It has to be JSON-serializable, actually. It's a bit of a strange requirement but Storm's JSON-driven multilang protocol doesn't give me much choice :/ – Louis Thibault Jun 07 '15 at 21:21
  • @blz: Well, you could `save` it to a `StringIO`, `read` the `StringIO`, and transform the bytes with base64 or something. – user2357112 Jun 07 '15 at 21:24
  • @user2357112 yeah, it will not work without some "hacks" on structured arrays but structured arrays (in my experience) are not used very often and dealing with them is always relatively complicated, as serializing shows... – daniel451 Jun 07 '15 at 21:26
  • ...and even **more important**: the question was about serializing numpy arrays with certain matrix dimenions (so floats, ints, ...). The question was not how to serialize multi-type structured arrays. – daniel451 Jun 07 '15 at 21:27
  • @ascenator, I certainly appreciate your input for this question, so nevermind those silly downvoters. It looks like the trend has reversed, anyway. – Louis Thibault Jun 07 '15 at 21:28
  • @blz I edited the question to try and make it clear that it should be STORM compatible. But you should really have included that. We cannot read minds. :-) – Roland Smith Jun 07 '15 at 21:32
  • Thanks. My intention was **not** to defend my answer or to offend anybody. I just wanted to point out that to my mind the question was about how to serialize numpy arrays (floats, ints, ...) having variable dimensions and not about multi-type structured arrays^^ – daniel451 Jun 07 '15 at 21:32
  • @RolandSmith, I'm not sure if I agree with your last batch of edits. Serializing to JSON is enough to make it storm compatible; no need to make it more complex than it is... – Louis Thibault Jun 07 '15 at 21:38
  • @blz Feel free to undo them. :-) But I think storm and/or JSON should be mentioned since they are relevant to the question. – Roland Smith Jun 07 '15 at 21:42
  • Yeah, I think the question is more about whether you want "just normal" numpy arrays to be serialized or really all scipy/numpy array objects one can think of, including multi-type structured arrays and stuff. I really thought of "just normal" arrays when I read your question and I guess this is what @RolandSmith wanted to point out with "STORM-compatible"?! – daniel451 Jun 07 '15 at 21:42
  • @RolandSmith, done. I just thought a friendly heads-up would be polite :) – Louis Thibault Jun 07 '15 at 21:46
  • @ascenator, currently, arrays of ints, floats and bools are all that's needed. It's reasonable to expect transposed arrays. I'll update the question. – Louis Thibault Jun 07 '15 at 21:47
  • @blz Thats exactly what I thought of when I read your question...for all this mentioned purposes my solution works well :) – daniel451 Jun 07 '15 at 21:48
  • Have you tried jsonpickle? – Eelco Hoogendoorn Oct 06 '16 at 07:22

9 Answers9

74

pickle.dumps or numpy.save encode all the information needed to reconstruct an arbitrary NumPy array, even in the presence of endianness issues, non-contiguous arrays, or weird structured dtypes. Endianness issues are probably the most important; you don't want array([1]) to suddenly become array([16777216]) because you loaded your array on a big-endian machine. pickle is probably the more convenient option, though save has its own benefits, given in the npy format rationale.

I'm giving options for serializing to JSON or a bytestring, because the original questioner needed JSON-serializable output, but most people coming here probably don't.

The pickle way:

import pickle
a = # some NumPy array

# Bytestring option
serialized = pickle.dumps(a)
deserialized_a = pickle.loads(serialized)

# JSON option
# latin-1 maps byte n to unicode code point n
serialized_as_json = json.dumps(pickle.dumps(a).decode('latin-1'))
deserialized_from_json = pickle.loads(json.loads(serialized_as_json).encode('latin-1'))

numpy.save uses a binary format, and it needs to write to a file, but you can get around that with io.BytesIO:

a = # any NumPy array
memfile = io.BytesIO()
numpy.save(memfile, a)

serialized = memfile.getvalue()
serialized_as_json = json.dumps(serialized.decode('latin-1'))
# latin-1 maps byte n to unicode code point n

And to deserialize:

memfile = io.BytesIO()

# If you're deserializing from a bytestring:
memfile.write(serialized)
# Or if you're deserializing from JSON:
# memfile.write(json.loads(serialized_as_json).encode('latin-1'))
memfile.seek(0)
a = numpy.load(memfile)
user2357112
  • 260,549
  • 28
  • 431
  • 505
  • 3
    Can you explain why `json.dumps(memfile.read().decode('latin-1'))` is included? – FGreg Mar 02 '16 at 19:44
  • 2
    @FGreg: It's there to serialize the raw bytes to JSON, because the questioner asked for JSON output. I don't remember why I *didn't* put something like that for the `pickle` option; it was probably related to bytestring vs unicode string issues. – user2357112 Mar 02 '16 at 20:23
  • 6
    In python 3, I had to replace `StringIO.StringIO()` with `io.BytesIO()` as [hinted here](http://stackoverflow.com/a/36187468/777285). – Arnaud P Feb 23 '17 at 09:07
18

EDIT: As one can read in the comments of the question this solution deals with "normal" numpy arrays (floats, ints, bools ...) and not with multi-type structured arrays.

Solution for serializing a numpy array of any dimensions and data types

As far as I know you can not simply serialize a numpy array with any data type and any dimension...but you can store its data type, dimension and information in a list representation and then serialize it using JSON.

Imports needed:

import json
import base64

For encoding you could use (nparray is some numpy array of any data type and any dimensionality):

json.dumps([str(nparray.dtype), base64.b64encode(nparray), nparray.shape])

After this you get a JSON dump (string) of your data, containing a list representation of its data type and shape as well as the arrays data/contents base64-encoded.

And for decoding this does the work (encStr is the encoded JSON string, loaded from somewhere):

# get the encoded json dump
enc = json.loads(encStr)

# build the numpy data type
dataType = numpy.dtype(enc[0])

# decode the base64 encoded numpy array data and create a new numpy array with this data & type
dataArray = numpy.frombuffer(base64.decodestring(enc[1]), dataType)

# if the array had more than one data set it has to be reshaped
if len(enc) > 2:
     dataArray.reshape(enc[2])   # return the reshaped numpy array containing several data sets

JSON dumps are efficient and cross-compatible for many reasons but just taking JSON leads to unexpected results if you want to store and load numpy arrays of any type and any dimension.

This solution stores and loads numpy arrays regardless of the type or dimension and also restores it correctly (data type, dimension, ...)

I tried several solutions myself months ago and this was the only efficient, versatile solution I came across.

daniel451
  • 10,626
  • 19
  • 67
  • 125
  • 2
    Upvoted because it's a usable answer. Two minor but related nitpicks. First, I'd suggest writing the array data as formatted text. That way it is human-readable and you get around possible endianness issues. Second, I would put both the dtype and the shape *before* the data, as a kind of "header". – Roland Smith Jun 07 '15 at 21:40
  • This still has the problem of requiring that the array be C-contiguous, and I highly suspect it'll produce incorrect output if the machine that serializes the array and the machine that deserializes it have different endianness. – user2357112 Jun 07 '15 at 22:32
  • i did something similar to your 'decompose the np.ndarray object into [dtype, data_buffer, shape] send, and recompose with np.frombuffer() and/or np.reshape()' for pushing the numpy array into shared memory for parallel processes. but i forgot about the shape. so my solution was limited to 1d only. – Trevor Boyd Smith Jan 07 '21 at 13:15
6

I found the code in Msgpack-numpy helpful. https://github.com/lebedov/msgpack-numpy/blob/master/msgpack_numpy.py

I modified the serialised dict slightly and added base64 encoding to reduce the serialised size.

By using the same interface as json (providing load(s),dump(s)), you can provide a drop-in replacement for json serialisation.

This same logic can be extended to add any automatic non-trivial serialisation, such as datetime objects.


EDIT I've written a generic, modular, parser that does this and more. https://github.com/someones/jaweson


My code is as follows:

np_json.py

from json import *
import json
import numpy as np
import base64

def to_json(obj):
    if isinstance(obj, (np.ndarray, np.generic)):
        if isinstance(obj, np.ndarray):
            return {
                '__ndarray__': base64.b64encode(obj.tostring()),
                'dtype': obj.dtype.str,
                'shape': obj.shape,
            }
        elif isinstance(obj, (np.bool_, np.number)):
            return {
                '__npgeneric__': base64.b64encode(obj.tostring()),
                'dtype': obj.dtype.str,
            }
    if isinstance(obj, set):
        return {'__set__': list(obj)}
    if isinstance(obj, tuple):
        return {'__tuple__': list(obj)}
    if isinstance(obj, complex):
        return {'__complex__': obj.__repr__()}

    # Let the base class default method raise the TypeError
    raise TypeError('Unable to serialise object of type {}'.format(type(obj)))


def from_json(obj):
    # check for numpy
    if isinstance(obj, dict):
        if '__ndarray__' in obj:
            return np.fromstring(
                base64.b64decode(obj['__ndarray__']),
                dtype=np.dtype(obj['dtype'])
            ).reshape(obj['shape'])
        if '__npgeneric__' in obj:
            return np.fromstring(
                base64.b64decode(obj['__npgeneric__']),
                dtype=np.dtype(obj['dtype'])
            )[0]
        if '__set__' in obj:
            return set(obj['__set__'])
        if '__tuple__' in obj:
            return tuple(obj['__tuple__'])
        if '__complex__' in obj:
            return complex(obj['__complex__'])

    return obj

# over-write the load(s)/dump(s) functions
def load(*args, **kwargs):
    kwargs['object_hook'] = from_json
    return json.load(*args, **kwargs)


def loads(*args, **kwargs):
    kwargs['object_hook'] = from_json
    return json.loads(*args, **kwargs)


def dump(*args, **kwargs):
    kwargs['default'] = to_json
    return json.dump(*args, **kwargs)


def dumps(*args, **kwargs):
    kwargs['default'] = to_json
    return json.dumps(*args, **kwargs)

You should be able to then do the following:

import numpy as np
import np_json as json
np_data = np.zeros((10,10), dtype=np.float32)
new_data = json.loads(json.dumps(np_data))
assert (np_data == new_data).all()
Rebs
  • 4,169
  • 2
  • 30
  • 34
4

Msgpack has the best serialization performance: http://www.benfrederickson.com/dont-pickle-your-data/

Use msgpack-numpy. See https://github.com/lebedov/msgpack-numpy

Install it:

pip install msgpack-numpy

Then:

import msgpack
import msgpack_numpy as m
import numpy as np

x = np.random.rand(5)
x_enc = msgpack.packb(x, default=m.encode)
x_rec = msgpack.unpackb(x_enc, object_hook=m.decode)
thayne
  • 426
  • 5
  • 18
2

If it needs to be human readable and you know that this is a numpy array:

import numpy as np; 
import json;

a = np.random.normal(size=(50,120,150))
a_reconstructed = np.asarray(json.loads(json.dumps(a.tolist())))
print np.allclose(a,a_reconstructed)
print (a==a_reconstructed).all()

Maybe not the most efficient as the array sizes grow larger, but works for smaller arrays.

Chris.Wilson
  • 131
  • 2
1

This wraps the pickle-based answer by @user2357112 for easier JSON integration

The code below will encode it as base64. It will handle numpy arrays of any type/size without needing to remember what it was. It will also handle other arbitrary objects that can be pickled.

import numpy as np
import json
import pickle
import codecs

class PythonObjectEncoder(json.JSONEncoder):
    def default(self, obj):
        return {
            '_type': str(type(obj)),
            'value': codecs.encode(pickle.dumps(obj, protocol=pickle.HIGHEST_PROTOCOL), "base64").decode('latin1')
            }

class PythonObjectDecoder(json.JSONDecoder):
    def __init__(self, *args, **kwargs):
        json.JSONDecoder.__init__(self, object_hook=self.object_hook, *args, **kwargs)

    def object_hook(self, obj):
        if '_type' in obj:
            try:
                return pickle.loads(codecs.decode(obj['value'].encode('latin1'), "base64"))
            except KeyError:
                return obj
        return obj


# Create arbitrary array
originalNumpyArray = np.random.normal(size=(3, 3))
print(originalNumpyArray)

# Serialization
numpyData = {
   "array": originalNumpyArray
   }
encodedNumpyData = json.dumps(numpyData, cls=PythonObjectEncoder)
print(encodedNumpyData)

# Deserialization
decodedArrays = json.loads(encodedNumpyData, cls=PythonObjectDecoder)
finalNumpyArray = decodedArrays["array"]

# Verify
print(finalNumpyArray)
print(np.allclose(originalNumpyArray, finalNumpyArray))
print((originalNumpyArray==finalNumpyArray).all())
VoteCoffee
  • 4,692
  • 1
  • 41
  • 44
1

Try numpy-serializer:

Download

pip install numpy-serializer

Usage

import numpy_serializer as ns
import numpy as np

a = np.random.normal(size=(50,120,150))
b = ns.to_bytes(a)
c = ns.from_bytes(b)
assert np.array_equal(a,c)
0

Try traitschema https://traitschema.readthedocs.io/en/latest/

"Create serializable, type-checked schema using traits and Numpy. A typical use case involves saving several Numpy arrays of varying shape and type."

SemanticBeeng
  • 937
  • 1
  • 9
  • 15
-5

Try using numpy.array_repr or numpy.array_str.

Ken
  • 1,778
  • 11
  • 10