Most efficient way to store this vector?

Question

I have a JSON string which contains a dictionary mapping index to float values. This is representative of a vector. For example,

{
    'distro': {0: 2.42, 3: 2.56},
    'constant': 4.55
    'size': 10000
}

represents a vector of size 10000 having 2.42 on index 0 and 2.56 on index 2. All other values in this vector are 4.55.

What's the most efficient way to represent this data structure? Will scipy.sparse help me? My main application is to create dense representations quickly, but I don't want to store them in memory beforehand (since there are many such vectors).

what you have is already pretty space efficient (the json..) what sort of (mathmatical?) operations do you need this datatype to support? — Aaron, Mar 23 '17 at 18:01
I need to create its dense representation fast. Running a `for` loop is going too slow — martianwars, Mar 23 '17 at 18:02
once you have created this array, what do you need to do with it that requires access to it's elements? not all sparse array classes support all of numpy's ufuncs and other functions (numpy.dot being an especially notable one) — Aaron, Mar 23 '17 at 18:05
Well I need to pass it to TensorFlow, which will then take care of it — martianwars, Mar 23 '17 at 18:06
To answer in brief, scipy cannot do this naitively, but you could probably subclass it.. Knowing how TensorFlow calls your object would let you know where you need to make modifications (does it ask for an index or a memory view, etc..) — Aaron, Mar 23 '17 at 18:10
Actually I would ideally like to have a fast function in place of my `for` loop for building the dense representation — martianwars, Mar 23 '17 at 18:13
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/138858/discussion-between-aaron-and-martianwars). — Aaron, Mar 23 '17 at 18:18
@Aaron, thank you! While I did not use what you answered, I realized that a dictionary was an overkill in my application. I've got it down to ~3 seconds per batch, which is fast enough — martianwars, Mar 24 '17 at 04:26

score 1 · Accepted Answer · edited May 23 '17 at 11:54

I imagine your iterative way is something like this:

In [204]: dd = {
     ...:     'distro': {0: 2.42, 3: 2.56},
     ...:     'constant': 4.55,
     ...:     'size': 10,
     ...: }
In [205]: dd
Out[205]: {'constant': 4.55, 'distro': {0: 2.42, 3: 2.56}, 'size': 10}

In [207]: x = np.zeros(dd['size'])
In [208]: x[:] = dd['constant']
In [210]: for i,v in dd['distro'].items():
     ...:     x[i] = v
In [211]: x
Out[211]: array([ 2.42,  4.55,  4.55,  2.56,  4.55,  4.55,  4.55,  4.55,  4.55,  4.55])

An alternative to the x[:], is x.fill(dd['constant']),but don't think there's much difference in speed.

Here's a way of setting values from the dictionary without explicit iteration:

In [221]: ddvals = np.array(list(dd['distro'].items()),dtype='i,f')
In [222]: ddvals
Out[222]: 
array([(0,  2.42000008), (3,  2.55999994)], 
      dtype=[('f0', '<i4'), ('f1', '<f4')])
In [223]: x[ddvals['f0']]=ddvals['f1']
In [224]: x
Out[224]: 
array([ 2.42000008,  4.55      ,  4.55      ,  2.55999994,  4.55      ,
        4.55      ,  4.55      ,  4.55      ,  4.55      ,  4.55      ])

or without the structured array:

In [225]: vals = np.array(list(dd['distro'].items()))
In [226]: vals
Out[226]: 
array([[ 0.  ,  2.42],
       [ 3.  ,  2.56]])
In [227]: x[vals[:,0]] = vals[:,1]
...
IndexError: arrays used as indices must be of integer (or boolean) type
In [228]: x[vals[:,0].astype(int)] = vals[:,1]
In [229]: x
Out[229]: array([ 2.42,  4.55,  4.55,  2.56,  4.55,  4.55,  4.55,  4.55,  4.55,  4.55])

The dictionary items() (or list(items()) in PY3) gives a list of tuples. Newer numpy versions don't like to use floats as indices, so we have to add a few steps to preserve the integer key values.

This might be the simplest:

x[list(dd['distro'].keys())] = list(dd['distro'].values())

(I assume keys, values and items return values in the same key order).

For this small case I suspect the plain iterative approach is faster. But something much larger one of that latter ones is probably better. I can't predict where the cross over occurs.

scipy.sparse makes 2d matrices. It does not implement any sort of const fill. (Pandas sparse does have such a fill). We could certainly construct a sparse matrix from dd['size'] and dd['distro']. But I don't know if it will offer any speed advantages.

And if Tensorflow is your real target, then you may need to look more at its construction methods. Maybe you don't need to pass through numpy or sparse at all.

This x, without the const can be represented as a scipy sparse matrix with:

In [247]: Xo = sparse.coo_matrix([x])
In [248]: Xo
Out[248]: 
<1x10 sparse matrix of type '<class 'numpy.float64'>'
    with 2 stored elements in COOrdinate format>

Its key attributes are:

In [249]: Xo.data
Out[249]: array([ 2.42,  2.56])
In [250]: Xo.row
Out[250]: array([0, 0], dtype=int32)
In [251]: Xo.col
Out[251]: array([0, 3], dtype=int32)
In [252]: Xo.shape
Out[252]: (1, 10)

Xr=Xo.tocsr() the csr format is similar, except the row attribute is replaced with a indptr array, which has one value per row (+1), so it doesnt' grow with the number of non-zero terms. It is used for most sparse math.

There is also a dok format, which is actually a dictionary subclass:

In [258]: dict(Xo.todok())
Out[258]: {(0, 0): 2.4199999999999999, (0, 3): 2.5600000000000001}

If the input is valid json, you will need to convert index keys to integer.

In [281]: jstr
Out[281]: '{"distro": {"0": 2.42, "3": 2.56}, "constant": 4.55, "size": 10}'
In [282]: jdd = json.loads(jstr)
In [283]: jdd
Out[283]: {'constant': 4.55, 'distro': {'0': 2.42, '3': 2.56}, 'size': 10}
In [284]: list(jdd['distro'].keys())
Out[284]: ['0', '3']
In [285]: np.array(list(jdd['distro'].keys()),int)
Out[285]: array([0, 3])
In [286]: np.array(list(jdd['distro'].values()))
Out[286]: array([ 2.42,  2.56])

My impression from SO searches is that json.load is as fast, if not faster than eval. It has to parse a much simpler syntax.

python eval vs ast.literal_eval vs JSON decode

If you can process the json strings, and store them in some sort intermediate data structure there are several possibilities. How 'sparse' are these vectors? If the dictionary has values for nearly all the 1000 'size' entries, it may be best to build the full numpy array and save that that (e.g. with np.save/load pair).

If it is sparse (say 10% of the values being non-const), the saving the 2 index and values arrays may make more sense (out 285 and 284). Either keep them separate, or join them in the kind of structured array I produced earlier.

you should see the discussion on the main comments thread, OP's problem is actually speed of parsing the json not allocation of arrays.... — Aaron, Mar 23 '17 at 19:23
If the input is valid `json`, then `json.loads` may be as fast as `eval`, http://stackoverflow.com/a/20276991/901925. However the `distro` keys will be strings, not integers in valid `json`. — hpaulj, Mar 23 '17 at 21:27

Aaron · Answer 2 · 2017-03-23T20:27:52.813

Based on our discussion, I have come up with an example of converting your json to an easy (fast) to parse binary format. see comments for description. I used BytesIO for testing, but you would likely simply use a file opened in 'rb' mode.

import numpy as np
import struct
import io


test_json = '''{
    'distro': {0: 2.42, 3: 2.56},
    'constant': 4.55,
    'size': 10000
}'''

#this will be slow but only have to take place once..
def json_to_bin(test_json):
    j = eval(test_json) #json provided was not well formed json, but was well formed python dict..
    const = j['constant']
    size = j['size']
    n = len(j['distro'])
    keys = j['distro'].keys()
    vals = [j['distro'][key] for key in keys]

    buf = io.BytesIO() #dummy file like object
    #struct.pack args:
        # format_string, *values_described_in_format_string
    buf.write(struct.pack("dii{0}i{0}d".format(n), const, size, n, *keys, *vals))
    return buf

def bin_to_arr(buf):
    buf.seek(0)
    #unpack our first three values 
    #most important is n so we can unpack the right number of keys and values
    const, size, n = struct.unpack("dii", buf.read(16)) #struct.calcsize('dii') = 16
    #unpack keys
    fmt = "{}i".format(n)
    nbytes = struct.calcsize(fmt)
    keys = np.array(struct.unpack(fmt, buf.read(nbytes))) #returns array (for indexing)
    #unpack vals
    fmt = "{}d".format(n)
    nbytes = struct.calcsize(fmt)
    vals = struct.unpack(fmt, buf.read(nbytes)) #returns tuple
    #create ndarray
    arr = np.empty(size)
    arr.fill(const)
    arr[keys] = vals
    return arr

print(test_json)
print('=======binary========')
buf = json_to_bin(test_json)
buf.seek(0)
print(buf.read())
arr = bin_to_arr(buf)

Most efficient way to store this vector?

2 Answers2