Convert a deafultdict to numpy matrix or a CSV of 2D Matrix

Question

I have a defaultdict which stores data co-occurence of every 2 words in a dataset. I have done this os as to get a sparse representation as not every pair is present in the dataset and hence to save some memory space.

Now are there some standard function which can convert this defaultdict to a numpy matrix, if possible to some sparse representation. I am aware of how to convert a dict to numpy array. But I am looking for a more efficient way of converting the defaultdict to matrix.

If it is not possible is there any standard function that convert the defaultdict to a CSV, so that I can load the csv using numpy.

Edit - I have found a workaround using pandas. I convert the defaultdict to DataFrame and then the DF to a numpy matrix. Is there any better method than this?

But sadly, this does not help with saving memory .

can you give an example of your default dict? and also the expected output? — Colonel Beauvel, May 30 '16 at 11:59
When it comes to accessing values that are already present, a `defaultdict` is the same as a regular `dict`. What are the keys and values of this `dict` like? Words or indexes? What kind of array layout do you want? If @Eric's answer does not fit, give us a small example - of the dictionary and desired array (possibly sparse). — hpaulj, May 30 '16 at 17:07
On building a sparse matrix from a dictionary of dictionaries. http://stackoverflow.com/questions/27770906/why-are-lil-matrix-and-dok-matrix-so-slow-compared-to-common-dict-of-dicts — hpaulj, May 30 '16 at 17:16

score 1 · Accepted Answer · answered May 30 '16 at 14:19

1

Assuming your data looks something like this:

data = defaultdict(int)
data[0,0] = 10
data[1,1] = 100

You want to use scipy.sparse.coo_matrix:

items = list(data.items())  # list only needed for python3
vs = [v for (i,j), v in items]
ii = [i for (i,j), v in items]
j j= [j for (i,j), v in items]
matrix = scipy.sparse.coo_matrix((vs, (ii, jj))

Which gives slightly strange output:

>>> print matrix
  (0, 0)    10
  (1, 1)    100

But you can treat this object as though it were a dense matrix

answered May 30 '16 at 14:19

Eric

95,302
53
242
374

1

There is also a `dok` sparse format, which a `dict` subclass. The keys are `(i,j)` tuples. I found in other SO questions that the fastest way to add values to a `dok` is with an `update` from another dictionary. – hpaulj May 30 '16 at 16:58

Convert a deafultdict to numpy matrix or a CSV of 2D Matrix

1 Answers1