I imagine your iterative way is something like this:
In [204]: dd = {
...: 'distro': {0: 2.42, 3: 2.56},
...: 'constant': 4.55,
...: 'size': 10,
...: }
In [205]: dd
Out[205]: {'constant': 4.55, 'distro': {0: 2.42, 3: 2.56}, 'size': 10}
In [207]: x = np.zeros(dd['size'])
In [208]: x[:] = dd['constant']
In [210]: for i,v in dd['distro'].items():
...: x[i] = v
In [211]: x
Out[211]: array([ 2.42, 4.55, 4.55, 2.56, 4.55, 4.55, 4.55, 4.55, 4.55, 4.55])
An alternative to the x[:]
, is x.fill(dd['constant'])
,but don't think there's much difference in speed.
Here's a way of setting values from the dictionary without explicit iteration:
In [221]: ddvals = np.array(list(dd['distro'].items()),dtype='i,f')
In [222]: ddvals
Out[222]:
array([(0, 2.42000008), (3, 2.55999994)],
dtype=[('f0', '<i4'), ('f1', '<f4')])
In [223]: x[ddvals['f0']]=ddvals['f1']
In [224]: x
Out[224]:
array([ 2.42000008, 4.55 , 4.55 , 2.55999994, 4.55 ,
4.55 , 4.55 , 4.55 , 4.55 , 4.55 ])
or without the structured array:
In [225]: vals = np.array(list(dd['distro'].items()))
In [226]: vals
Out[226]:
array([[ 0. , 2.42],
[ 3. , 2.56]])
In [227]: x[vals[:,0]] = vals[:,1]
...
IndexError: arrays used as indices must be of integer (or boolean) type
In [228]: x[vals[:,0].astype(int)] = vals[:,1]
In [229]: x
Out[229]: array([ 2.42, 4.55, 4.55, 2.56, 4.55, 4.55, 4.55, 4.55, 4.55, 4.55])
The dictionary items()
(or list(items())
in PY3) gives a list of tuples. Newer numpy
versions don't like to use floats as indices, so we have to add a few steps to preserve the integer key values.
This might be the simplest:
x[list(dd['distro'].keys())] = list(dd['distro'].values())
(I assume keys
, values
and items
return values in the same key order).
For this small case I suspect the plain iterative approach is faster. But something much larger one of that latter ones is probably better. I can't predict where the cross over occurs.
scipy.sparse
makes 2d matrices. It does not implement any sort of const
fill. (Pandas sparse does have such a fill). We could certainly construct a sparse
matrix from dd['size']
and dd['distro']
. But I don't know if it will offer any speed advantages.
And if Tensorflow is your real target, then you may need to look more at its construction methods. Maybe you don't need to pass through numpy
or sparse
at all.
This x
, without the const
can be represented as a scipy
sparse matrix with:
In [247]: Xo = sparse.coo_matrix([x])
In [248]: Xo
Out[248]:
<1x10 sparse matrix of type '<class 'numpy.float64'>'
with 2 stored elements in COOrdinate format>
Its key attributes are:
In [249]: Xo.data
Out[249]: array([ 2.42, 2.56])
In [250]: Xo.row
Out[250]: array([0, 0], dtype=int32)
In [251]: Xo.col
Out[251]: array([0, 3], dtype=int32)
In [252]: Xo.shape
Out[252]: (1, 10)
Xr=Xo.tocsr()
the csr
format is similar, except the row
attribute is replaced with a indptr
array, which has one value per row (+1), so it doesnt' grow with the number of non-zero terms. It is used for most sparse math.
There is also a dok
format, which is actually a dictionary subclass:
In [258]: dict(Xo.todok())
Out[258]: {(0, 0): 2.4199999999999999, (0, 3): 2.5600000000000001}
If the input is valid json
, you will need to convert index keys to integer.
In [281]: jstr
Out[281]: '{"distro": {"0": 2.42, "3": 2.56}, "constant": 4.55, "size": 10}'
In [282]: jdd = json.loads(jstr)
In [283]: jdd
Out[283]: {'constant': 4.55, 'distro': {'0': 2.42, '3': 2.56}, 'size': 10}
In [284]: list(jdd['distro'].keys())
Out[284]: ['0', '3']
In [285]: np.array(list(jdd['distro'].keys()),int)
Out[285]: array([0, 3])
In [286]: np.array(list(jdd['distro'].values()))
Out[286]: array([ 2.42, 2.56])
My impression from SO searches is that json.load
is as fast, if not faster than eval
. It has to parse a much simpler syntax.
python eval vs ast.literal_eval vs JSON decode
If you can process the json
strings, and store them in some sort intermediate data structure there are several possibilities. How 'sparse' are these vectors? If the dictionary has values for nearly all the 1000 'size' entries, it may be best to build the full numpy array and save that that (e.g. with np.save/load
pair).
If it is sparse (say 10% of the values being non-const), the saving the 2 index and values arrays may make more sense (out 285 and 284). Either keep them separate, or join them in the kind of structured array I produced earlier.