How should I read/write a data structure containing large arrays?

Question

I fetch a big array of data from a server. I store it in a combination of a dictionary and multi-dimensional array, and it will be used for a simple plot. It looks like:

>> print(data)
{'intensity_b2': [array([  1.46562588e+09,   1.46562588e+09,   1.46562588e+09, ...,
     1.46566369e+09,   1.46566369e+09,   1.46566369e+09]), array([ 0.,  0.,  0., ...,  0.,  0.,  0.])]}
>> print(len(data['intensity_b2'][0]))
37071

To avoid fetching the data every time I run the script I want to save this data structure to a file. I try to store the data as

with open("data.dat", 'w') as f:
    f.write(str(data))

and read it with

with open(data_store, 'r') as f:
    data = ast.literal_eval(f.read())

as suggested here. However, I get an error

ValueError: malformed node or string: <_ast.Call object at 0x108fce5f8>

which I suspect is due to the fact that the data gets stored with the ... as was shown in the first printout (i.e. the first print(data) above is literally how the data looks in the file). How do I write a dictionary with a big array to a file and read it subsequently?

score 1 · Answer 1 · answered Oct 14 '16 at 08:07

Your problem is that str is not a suitable way to serialise data. Typically, objects will have a string representation that would let a human understand what they are. For primitive objects, it would be a format that you could even eval to get back an equivalent object, but this isn't true generally.

You need to decide how you want to serialise the data. You could use something like JSON, but then you'd need to figure out how to convert object too/from primitive data types anyway, and I think it's already clear that you're not using just primitive data types.

You probably want to use pickle to create a serialised version of the data which which you will be able to unpickle later and get the same data types back.

score 1 · Accepted Answer · answered Oct 14 '16 at 08:15

1

You can use pickle to handle serialization properly:

In [23]: a
Out[23]: 
{'intensity_b2': [array('f', [1465625856.0, 1465625856.0, 1465625856.0]),
  array('f', [1465663744.0, 1465663744.0, 1465663744.0])]}

In [24]: pickle.dump(a, open('foo.p', 'wb'))

In [25]: aa = pickle.load(open('foo.p', 'rb'))

In [26]: aa
Out[26]: 
{'intensity_b2': [array('f', [1465625856.0, 1465625856.0, 1465625856.0]),
  array('f', [1465663744.0, 1465663744.0, 1465663744.0])]}

This does exactly what you want to do: saves your data structure to a file, and then reads it from the file.

However, it looks like you're reinventing the wheel here. You may want to have a look at numpy and pandas.

answered Oct 14 '16 at 08:15

t_tia

556
1
4
17

Thanks for your answer. Reinventing the wheel how so? The data will be used purely for plotting. – pingul Oct 14 '16 at 08:18
1

Guessing from the detais you've provided, you receive data from a remote server, then convert it into a dictionary of nested arrays, then save it into a file. `pandas` and `numpy` provide highly efficient data structures to store large data sets. You don't have to build a dictionary of arrays, you can use `numpy.array` or `pandas.DataFrame`. Both modules have built-in tools to handle serialization and saving to files, and `pandas` (which is a higher level module than `numpy`) can convert data to and from virtually any format (json, csv, ....). – t_tia Oct 14 '16 at 08:40
@pingul it depends on what exactly you do between receiving the data from a server, and saving it into a file (and whether you build plots in python). I'm just saying that if you do some data manipulation, it may happen that `pandas` already has a built-in tool for that. – t_tia Oct 14 '16 at 08:49
I actually call a python API that already provide the information as shown. No modification of significance is made. Your input is however appreciated. – pingul Oct 14 '16 at 08:51

How should I read/write a data structure containing large arrays?

2 Answers2